This document walks through the code used to obtain the results of my project. The section include:
Data Cleaning Visualizations PCA and Hierarchical CLustering Model Building and Diagnostics
The R commands for this project will be supplied as needed. First, the following libraries and functions will need to be loaded.
# rm(list = ls())
setwd("/Volumes/Macintosh HD - Data/SCHOOL/WGU/CapstoneProject") ##set your path
library(tidycensus)
library(dplyr)
library(tidyverse)
library(tigris)
library(leaflet)
library(stringr)
library(sf)
library(purrr)
library(zipcode)
library(stringi)
library(ggplot2)
library(devtools)
library(tmap)
library(tmaptools)
library(FactoMineR)
library(tm)
library(stats)
library(openintro)
library(missMDA)
library(pscl)
library(factoextra)
library(openintro)
library(missMDA)
library(devtools)
library(PerformanceAnalytics)
library(ggrepel)
library(scales)
library(conflicted)
library(ResourceSelection)
library(caret)
library(MASS)
library(mpath)
library(DataExplorer)
library(corrplot)
library(knitr)
## set conflicts
conflict_prefer("select", "dplyr")
[conflicted] Removing existing preference
[conflicted] Will prefer [34mdplyr::select[39m over any other package
conflict_prefer("filter", "dplyr")
[conflicted] Removing existing preference
[conflicted] Will prefer [34mdplyr::filter[39m over any other package
## custom functions
medianWithoutNA = function(x) {
median(x[which(!is.na(x))])
}
add_cols <- function(.data, ..., .f = sum){
tmp <- dplyr::select_at(.data, dplyr::vars(...))
purrr::pmap_dbl(tmp, .f = .f)
} ## great function to sum up multiple columns from https://github.com/tidyverse/dplyr/issues/4544
The farmers market, election, USDA, and zip code data files are uploaded. Because most of these files use state and county codes, such as “003,” we set the appropriate columns to character columns and add leading 0’s where necessary. Additionally we are adding the FIPS code to the USDA data, which is a concatonation of the sate and county codes.
The farmers market data can be found here: https://www.kaggle.com/madeleineferguson/farmers-markets-in-the-united-states?select=farmers_markets_from_usda.csv
The USDA data was obtained using the QuickStat filters at: https://quickstats.nass.usda.gov/ Data was collected at the county level for te 2017 estimates.
farmers_markets = read.csv("Raw Data/farmers_markets_from_usda.csv") %>% mutate_if(is.factor, as.character)
Election2016 = read.csv("Raw Data/Election2016byCounty.csv",
colClasses = c(Fips = "character")) %>% mutate_if(is.factor, as.character)
zips_in_county_subdiv = read.table("Raw Data/Zip Code Referential Table.txt", header = TRUE, sep = ",",
colClasses = c(ZCTA5 = "character",
STATE = "character", COUNTY = "character", COUSUB = "character",
GEOID = "character", CLASSFP = "character"))
animal_total_sales = read.csv("Raw Data/USDA DATA/AnimalTotalByCounty.csv",
colClasses = c(State.ANSI = "character", County.ANSI = "character" )) %>%
mutate(State.ANSI = ifelse(nchar(State.ANSI) == 1, paste0("0", State.ANSI), State.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 1, paste0("0", County.ANSI), County.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 2, paste0("0", County.ANSI), County.ANSI),
Fips = paste0(State.ANSI,County.ANSI))
crop_total_sales = read.csv("Raw Data/USDA DATA/CropTotalsSalesDollars.csv",
colClasses = c(State.ANSI = "character", County.ANSI = "character" )) %>%
mutate(State.ANSI = ifelse(nchar(State.ANSI) == 1, paste0("0", State.ANSI), State.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 1, paste0("0", County.ANSI), County.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 2, paste0("0", County.ANSI), County.ANSI),
Fips = paste0(State.ANSI,County.ANSI))
fruit_and_nuts_total_sales = read.csv("Raw Data/USDA DATA/FruitNutsSAlesDollars.csv",
colClasses = c(State.ANSI = "character", County.ANSI = "character" )) %>%
mutate(State.ANSI = ifelse(nchar(State.ANSI) == 1, paste0("0", State.ANSI), State.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 1, paste0("0", County.ANSI), County.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 2, paste0("0", County.ANSI), County.ANSI),
Fips = paste0(State.ANSI,County.ANSI))
veggie_total_sales = read.csv("Raw Data/USDA DATA/VeggieTotalSalesDollars.csv",
colClasses = c(State.ANSI = "character", County.ANSI = "character" )) %>%
mutate(State.ANSI = ifelse(nchar(State.ANSI) == 1, paste0("0", State.ANSI), State.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 1, paste0("0", County.ANSI), County.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 2, paste0("0", County.ANSI), County.ANSI),
Fips = paste0(State.ANSI,County.ANSI))
AG_land = read.csv("Raw Data/USDA DATA/AGLand.csv",
colClasses = c(State.ANSI = "character", County.ANSI = "character" )) %>%
mutate(State.ANSI = ifelse(nchar(State.ANSI) == 1, paste0("0", State.ANSI), State.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 1, paste0("0", County.ANSI), County.ANSI),
County.ANSI = ifelse(nchar(County.ANSI) == 2, paste0("0", County.ANSI), County.ANSI),
Fips = paste0(State.ANSI,County.ANSI))
The goal will be to create a frequency table with the farmers market data with the number of farmers markets per county, along with cleaning and matching incomplete entries. The USDA data can be combined into one data set. The zip code data will be used as a reference, assigning a given zip code to the county where the largest portion of residents live (zip codes can span more than one county). The election data set is complete. All of these tables will be joined by using the FIPS code as a primary key.
FIPS codes will be used to join all six tables, which contain the values of the given statistic along with its CV. Alaska will not be included.
Next the six tables are joined together using the Fips code as a primary key. Alaska, state code 02, is being fitered out.
U1 = animal_total_sales %>% dplyr::select(Fips, State, County, State.ANSI,County.ANSI, Ag.District, Value, CV....) %>% rename( AnimalSales = Value, AnimalCV = CV....) %>% filter(State.ANSI != "02") %>% left_join(crop_total_sales %>% filter(Year == 2017) %>% select(Fips, Value, CV....) %>% rename(CropSales = Value, CropCV = CV....), by = "Fips") %>% left_join(fruit_and_nuts_total_sales %>% filter(Year == 2017) %>% select(Fips, Value, CV....) %>% rename(FruitNutSales = Value, FruitNutCV = CV....), by = "Fips")%>% left_join(veggie_total_sales %>% filter(Year == 2017) %>% select(Fips, Value, CV....) %>% rename(VeggieSales = Value, VeggieCV = CV....), by = "Fips") %>% left_join(AG_land %>% filter(Year == 2017) %>% select(Fips, Value, CV....) %>% rename(AgLandAcres = Value, AgLandCV = CV....), by = "Fips") %>% filter(!State == "02")
head(U1)
str(U1)
'data.frame': 3068 obs. of 16 variables:
$ Fips : chr "01001" "01011" "01047" "01051" ...
$ State : chr "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...
$ County : chr "AUTAUGA" "BULLOCK" "DALLAS" "ELMORE" ...
$ State.ANSI : chr "01" "01" "01" "01" ...
$ County.ANSI : chr "001" "011" "047" "051" ...
$ Ag.District : chr "BLACK BELT" "BLACK BELT" "BLACK BELT" "BLACK BELT" ...
$ AnimalSales : chr "8925000.00" " (D)" "36114000.00" "10212000.00" ...
$ AnimalCV : chr "12.70" "(D)" "12.70" "12.70" ...
$ CropSales : chr "12535000.00" " (D)" "27886000.00" "17382000.00" ...
$ CropCV : chr "16.30" "(D)" "16.30" "16.30" ...
$ FruitNutSales: chr " (D)" "396,000" "204,000" "167,000" ...
$ FruitNutCV : chr "(D)" "30.9" "30.9" "30.9" ...
$ VeggieSales : chr "2,523,000" "121,000" "1,148,000" "379,000" ...
$ VeggieCV : chr "32.8" "32.8" "32.8" "32.8" ...
$ AgLandAcres : chr "13,211" "3,359" "38,177" "6,589" ...
$ AgLandCV : chr "22.4" "22.4" "22.4" "22.4" ...
plot_missing(U1)
Using the NASS glossary, the code (D) is for disclosed. (H) refers to high CV value, over 99.95%, and (Z) refers to almost 0. This coding is causing numeric variables to be interpreted as a character. Let’s see what’s missing.
tot_na1 = apply(U1, 2, function(x) length(which(is.na(x))))
tot_na1[tot_na1 > 0]
CropSales CropCV FruitNutSales FruitNutCV VeggieSales VeggieCV AgLandAcres AgLandCV
4 4 328 328 254 254 36 36
The missing or disclosed numeric variables are removed and the columns are formatted to be numeric variables. Drop FruitNutSales, FruitNutCV, VeggieSales, and VeggieCV due to missingness.
## (D) = disclosed, (H) = high, 99.95%+, (Z) = almost 0 (from NASS glossary)
## remove commas, change to numeric variables
w = 7:16 ## column indices of variables to be imputed
U1 = cbind(U1[ ,-w], apply(U1[ ,w], 2, trimws))
U1 = cbind(U1[ ,-w], apply(U1[ ,w], 2, function(y) gsub("(D)", NA, y)))
U1 = cbind(U1[ ,-w], apply(U1[ ,w], 2, function(y) gsub("(Z)", "0", y)))
U1 = cbind(U1[ ,-w], apply(U1[ ,w], 2, function(y) gsub("(H)", "99.95", y)))
U1 = cbind(U1[ ,-w], apply(U1[ ,w], 2, function(y) gsub(",", "", y)))
U1 = cbind(U1[ ,-w], apply(U1[ ,w], 2, as.numeric))
NAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercion
U1 = cbind(U1[ ,-w], apply(U1[ ,w], 2, as.numeric))
Missing and disclosed values will be imputed by using the agricultural district medians. Drop
U2 = U1 %>% group_by(Ag.District) %>% filter(!is.na(State)) %>%
mutate(AnimalSales = ifelse(is.na(AnimalSales), medianWithoutNA(AnimalSales),
AnimalSales),
AnimalCV = ifelse(is.na(AnimalCV), medianWithoutNA(AnimalCV), AnimalCV),
FruitNutSales = ifelse(is.na(FruitNutSales ), medianWithoutNA(FruitNutSales ),
FruitNutSales ),
FruitNutCV = ifelse(is.na(FruitNutCV), medianWithoutNA(FruitNutCV), FruitNutCV),
VeggieSales = ifelse(is.na(VeggieSales), medianWithoutNA(VeggieSales),
VeggieSales),
VeggieCV = ifelse(is.na(VeggieCV), medianWithoutNA(VeggieCV), VeggieCV),
CropSales = ifelse(is.na(CropSales), medianWithoutNA(CropSales), CropSales),
CropCV = ifelse(is.na(CropCV ), medianWithoutNA(CropCV ), CropCV ),
AgLandAcres = ifelse(is.na(AgLandAcres), medianWithoutNA(AgLandAcres), AgLandAcres),
AgLandCV = ifelse(is.na(AgLandCV), medianWithoutNA(AgLandCV), AgLandCV)
) %>% ungroup()
tot_na2 = apply(U2, 2, function(x) length(which(is.na(x))))
tot_na2
Fips State County State.ANSI County.ANSI Ag.District AnimalSales AnimalCV
0 0 0 0 0 0 0 0
CropSales CropCV FruitNutSales FruitNutCV VeggieSales VeggieCV AgLandAcres AgLandCV
0 0 31 16 12 12 0 0
The remaining NA’s will be imputed using state medians.
U3 = U2 %>% group_by(State) %>%
mutate(AnimalSales = ifelse(is.na(AnimalSales), medianWithoutNA(AnimalSales), AnimalSales),
AnimalCV = ifelse(is.na(AnimalCV), medianWithoutNA(AnimalCV), AnimalCV),
FruitNutSales = ifelse(is.na(FruitNutSales ), medianWithoutNA(FruitNutSales ), FruitNutSales ),
FruitNutCV = ifelse(is.na(FruitNutCV), medianWithoutNA(FruitNutCV), FruitNutCV),
VeggieSales = ifelse(is.na(VeggieSales), medianWithoutNA(VeggieSales), VeggieSales),
VeggieCV = ifelse(is.na(VeggieCV), medianWithoutNA(VeggieCV), VeggieCV),
CropSales = ifelse(is.na(CropSales), medianWithoutNA(CropSales), CropSales),
CropCV = ifelse(is.na(CropCV ), medianWithoutNA(CropCV ), CropCV ),
AgLandAcres = ifelse(is.na(AgLandAcres), medianWithoutNA(AgLandAcres), AgLandAcres),
AgLandCV = ifelse(is.na(AgLandCV), medianWithoutNA(AgLandCV), AgLandCV)
) %>% ungroup()
tot_na3 = apply(U3, 2, function(x) length(which(is.na(x))))
tot_na3
Fips State County State.ANSI County.ANSI Ag.District AnimalSales
0 0 0 0 0 0 0
AnimalCV CropSales CropCV FruitNutSales FruitNutCV VeggieSales VeggieCV
0 0 0 0 0 0 0
AgLandAcres AgLandCV
0 0
Now we write the csv file to be used again later. It should be noted that this data set uses the current set of FIPS codes, which changed in 2015. There were a few changes to the codes, particularly in the state of Virginia.
USDA_data_cleaned = U3
write.csv(U3, "Cleaned Data/USDA_data.csv")
This data set contains many useful pieces of information compiled through several sources. (https://github.com/Deleetdk/USA.county.data).
dim(Election2016)
[1] 3143 159
str(Election2016[ ,1:25])
'data.frame': 3143 obs. of 25 variables:
$ State : chr "Minnesota" "Kansas" "Oklahoma" "Montana" ...
$ ST : chr "MN" "KS" "OK" "MT" ...
$ Fips : chr "27017" "20127" "40107" "30085" ...
$ County : chr "Carlton County, Minnesota" "Morris County, Kansas" "Okfuskee County, Oklahoma" "Roosevelt County, Montana" ...
$ Precincts : int 38 16 13 12 812 24 12 39 NA 12 ...
$ Votes : int 18059 2568 3933 3502 320164 15177 11317 31094 NA 6367 ...
$ Democrats.08..Votes. : int 11501 907 1480 2564 207371 7127 2248 17940 NA 1913 ...
$ Democrats.12..Votes. : int 11389 718 1256 2086 193501 6921 1789 16330 NA 1733 ...
$ Republicans.08..Votes. : int 6549 1875 2643 1473 144262 7817 8658 12863 NA 4153 ...
$ Republicans.12..Votes. : int 6586 1773 2335 1514 133362 7973 8446 11996 NA 3911 ...
$ Republicans.2016 : num 45.2 69.7 71 49.2 40.3 ...
$ Democrats.2016 : num 46.8 22.8 24 42.9 54.4 ...
$ Green.2016 : num 1.29 2.34 NA 2.57 1.55 ...
$ Libertarians.2016 : num 3.89 5.14 5.06 4.85 3.83 ...
$ Republicans.2012 : num 35.7 69.2 65 41.2 40 ...
$ Republicans.2008 : num 35.5 66 64.1 35.5 40.5 ...
$ Democrats.2012 : num 61.8 28 35 56.8 58 ...
$ Democrats.2008 : num 62.3 31.9 35.9 61.7 58.2 ...
$ Less.Than.High.School.Diploma: num 9.7 9.9 21.2 10.9 11.6 24.8 30.4 12.5 35.9 23.2 ...
$ At.Least.High.School.Diploma : num 90.3 90.1 78.8 89.1 88.4 75.2 69.6 87.5 64.1 76.8 ...
$ At.Least.Bachelors.s.Degree : num 21.4 16.6 10.9 17.3 34.8 13.4 11 18.5 9.6 12.1 ...
$ Graduate.Degree : num 7.2 7.7 3.1 4.7 15.1 4.8 3.2 6.2 0.5 4.9 ...
$ School.Enrollment : num 76.2 74.8 73.5 74.3 81.5 ...
$ Median.Earnings.2010 : num 30427 25342 22072 27895 29747 ...
$ White..Not.Latino..Population: num 89.5 94.2 64 36 74.2 ...
length(unique(Election2016$County))
[1] 3143
Some information is redundant. We will keep the percentages of votes by political party, and exclude the columns that contain the number of votes per party.
E1 = Election2016 %>% filter(ST != "AK") %>% dplyr::select(State:Votes, Republicans.2016:Autumn.Tmin, temp, precip)
Checking NA’s.
tot_na1 = apply(E1, 2, function(x) length(which(is.na(x))))
tot_na1[tot_na1 > 0]
Precincts
3
Votes
3
Republicans.2016
3
Democrats.2016
3
Green.2016
513
Libertarians.2016
3
Republicans.2012
2
Republicans.2008
2
Democrats.2012
2
Democrats.2008
2
Preschool.Enrollment.Ratio.enrolled.ages.3.and.4
8
Child.Poverty.living.in.families.below.the.poverty.line
1
Poor.physical.health.days
333
Poor.mental.health.days
551
Low.birthweight
215
Teen.births
91
Children.in.single.parent.households
2
Adult.smoking
429
Sexually.transmitted.infections
182
HIV.prevalence.rate
803
Uninsured
1
Unemployment
1
Violent.crime
168
Homicide.rate
1868
Injury.deaths
286
Infant.mortality
1706
CA
42
S
10
MAR
10
CFS
10
ACFS
10
Mean.Alc
10
Max.Alc
10
Mixedness
10
elevation
227
Annual.Prcp
511
Winter.Prcp
511
Summer.Prcp
511
Spring.Prcp
511
Autumn.Prcp
511
Annual.Tavg
1392
Annual.Tmax
1392
Annual.Tmin
1392
Winter.Tavg
1392
Winter.Tmax
1392
Winter.Tmin
1392
Summer.Tavg
1392
Summer.Tmax
1392
Summer.Tmin
1392
Spring.Tavg
1392
Spring.Tmax
1392
Spring.Tmin
1392
Autumn.Tavg
1392
Autumn.Tmax
1392
Autumn.Tmin
1392
temp
1392
precip
511
Homicide rates and infant mortality have too many NA’s so it will be dropped. Otherwise state medians will be used to impute the remaining missing values. There are many missing temperatures but these will be imputed by state medians, which should be fairly representative of the county.
E2 = E1 %>% dplyr::select(-Homicide.rate, -Infant.mortality ) %>%
group_by(State) %>%
mutate_each(funs(ifelse(is.na(.),median(., na.rm = TRUE),.))) %>% ungroup()
Checking NA’s.
tot_na2 = apply(E2, 2, function(x) length(which(is.na(x))))
tot_na2[tot_na2 > 0]
Green.2016 HIV.prevalence.rate CA Annual.Prcp Winter.Prcp
511 119 6 5 5
Summer.Prcp Spring.Prcp Autumn.Prcp Annual.Tavg Annual.Tmax
5 5 5 5 5
Annual.Tmin Winter.Tavg Winter.Tmax Winter.Tmin Summer.Tavg
5 5 5 5 5
Summer.Tmax Summer.Tmin Spring.Tavg Spring.Tmax Spring.Tmin
5 5 5 5 5
Autumn.Tavg Autumn.Tmax Autumn.Tmin temp precip
5 5 5 5 5
Remaining missing will be imputed by global column medians.
E3 = E2 %>%
mutate_each(funs(ifelse(is.na(.),median(., na.rm = TRUE),.)))
Checking NA’s.
tot_na3 = apply(E3, 2, function(x) length(which(is.na(x))))
tot_na3[tot_na3 > 0]
named integer(0)
Good to go!
Election2016Cleaned = E3
write.csv(E3, "Cleaned Data/Election2016Cleaned.csv")
Use tidycensus to link up with census API.
First load avaiable variables
census_variables = load_variables(year = 2010, dataset = "sf1", cache = TRUE)
I would like the following variables: family to household ratio = P018002/P018001 married household to family household ratio = P018003/P018002 urban ratio = H002002 / H002001 renter occupied = H004004 / H004001 household size categories = H013002:H013008 / H013001 Ave househould size = H012001 Ave family size = P037A001 total female pop = P012026/ P012001 male age groups = (P012003:7 (under 20)), (P012008:10 (20-24)) , (P012011:14( 25-44)), (P012015:6 (45-54)), (P012017 (55-59)), (P012018:25 (60+)) / P012002 (total) female age groups = (P0120273:31 (under 20)), (P012032:34 (20-24)) , (P012038:38( 25-44)), (P012039:40 (45-54)), (P012041 (55-59)), (P012042:49 (60+)) / P012026 (total) husband - wife - children families ratio = P038003/ P038001
vars = c(paste0("P01800",1:3), "H002001", "H002002", "H004001" , "H004004", "H012001",
paste0("H01300", 1:8), "P037A001", paste0("P01200", 1:9), paste0("P0120", 10:49),
"P038001", "P038003")
Now let’s turn this raw data into the ratios I’m after and drop the raw numbers. These are the first of my derived variables.
Census_data_cleaned = census_raw_data %>% rename(Fips = GEOID) %>%
mutate(FamilyRatio = P018002/P018001,
MarriedHouseholdRatio = P018003/P018002,
RenterOccupied = H004004/H004001,
AveHousehouldSize = H012001,
AveFamilySize = P037A001,
TotalFemale = P012026/ P012001,
MaleUnder20 = add_cols(., P012003:P012007) / P012002,
Male20to24 = add_cols(.,P012008:P012010) / P012002,
Male25to44 = add_cols(.,P012011:P012014) / P012002,
Male44to54 = add_cols(.,P012015:P012016) / P012002,
Male55to59 = P012017 / P012002,
MaleOver59 = add_cols(.,P012018:P012025) / P012026,
FemaleUnder20 = add_cols(.,P012027:P012031) / P012026,
Female20to24 = add_cols(.,P012032:P012034) / P012026,
Female25to44 = add_cols(.,P012035:P012038) / P012026,
Female44to54 = add_cols(.,P012039:P012040) / P012026,
Female55to59 = P012041 / P012026,
FemaleOver59 = add_cols(.,P012042:P012049) / P012026,
HusbandWifeFamilyRatio = P038003/ P038001
) %>% dplyr::select(-all_of(vars))
Just like a mirror -looking good!
def’s for the census data https://www.census.gov/programs-surveys/cps/technical-documentation/subject-definitions.html#household
The farmers market table contains some incomplete records. The county of each farmers market is needed but missing or unmatched on the farmers market table. Additionally, there is no FIPS code provided on the table. The purpose of this section have a reference so a zip code can be matched to its county. However, some zip code span more than one county. To resolve this, it has been decided to match each zip code with the county where the most amount of residents from that zip code.
A copy of a zip code to county subdivision reference table is available through the US Census Bureau.
str(zips_in_county_subdiv)
'data.frame': 107996 obs. of 26 variables:
$ ZCTA5 : chr "00601" "00601" "00601" "00601" ...
$ STATE : chr "72" "72" "72" "72" ...
$ COUNTY : chr "001" "001" "001" "001" ...
$ COUSUB : chr "00401" "13645" "30458" "32049" ...
$ CLASSFP : chr "Z1" "Z1" "Z1" "Z1" ...
$ GEOID : chr "7200100401" "7200113645" "7200130458" "7200132049" ...
$ POPPT : int 4406 1038 1337 140 254 1176 865 276 504 606 ...
$ HUPT : int 1968 425 509 60 115 461 347 128 187 252 ...
$ AREAPT : num 1942319 9420707 16497991 7312819 2763743 ...
$ AREALANDPT : num 1942319 9387179 16271520 6974412 2763743 ...
$ ZPOP : int 18570 18570 18570 18570 18570 18570 18570 18570 18570 18570 ...
$ ZHU : int 7744 7744 7744 7744 7744 7744 7744 7744 7744 7744 ...
$ ZAREA : num 1.67e+08 1.67e+08 1.67e+08 1.67e+08 1.67e+08 ...
$ ZAREALAND : num 1.67e+08 1.67e+08 1.67e+08 1.67e+08 1.67e+08 ...
$ CSPOP : int 4406 1054 1337 140 853 1176 865 276 553 606 ...
$ CSHU : int 1968 434 509 60 369 461 347 128 207 252 ...
$ CSAREA : num 1942319 9494851 16497991 7312952 7695788 ...
$ CSAREALAND : num 1942319 9461323 16271520 6974412 7443424 ...
$ ZPOPPCT : num 23.73 5.59 7.2 0.75 1.37 ...
$ ZHUPCT : num 25.41 5.49 6.57 0.77 1.49 ...
$ ZAREAPCT : num 1.16 5.63 9.85 4.37 1.65 ...
$ ZAREALANDPCT : num 1.17 5.63 9.76 4.18 1.66 ...
$ CSPOPPCT : num 100 98.5 100 100 29.8 ...
$ CSHUPCT : num 100 97.9 100 100 31.2 ...
$ CSAREAPCT : num 100 99.2 100 100 35.9 ...
$ CSAREALANDPCT: num 100 99.2 100 100 37.1 ...
This table contains data on the population, number of housing units, land area for section of the zip code contained in each county. The “CS” prefix refers to the county sibdivision of that zip code. A county is composed of multiple subdivisions, and we can use that fact to calculate the total land area and housing units of each county given this table.
Z1 = zips_in_county_subdiv %>% group_by(STATE, COUNTY) %>%
mutate(CountyHU = sum(CSHU), CountyArea = sum(CSAREA)) %>% ungroup()
The zip sections are arranged by descending population. FIPS codes are added and only the most popluous section of each unique zip code is selected.
Z2 = Z1 %>% arrange(ZCTA5, desc(POPPT)) %>% mutate(Fips = paste0(STATE, COUNTY)) %>% select(-CLASSFP)##sort by zip and population
zips = unique(zips_in_county_subdiv$ZCTA5) ## keep only the first unique zip
Z3 = Z2[!duplicated(Z2$ZCTA5), ] #clean
The tidyCENSUS package contians the county names corresponding to each FIPS code. We create a reference table.
County_Code_Ref = fips_codes %>% mutate(Fips = paste0(state_code, county_code))
head(County_Code_Ref)
Lastly we join the Z3 table with the County_Code_Ref table to include county names.
Z4 = left_join(Z3, County_Code_Ref %>% dplyr::select(Fips, county, state_name, state), by = "Fips") %>% rename(ST = state, Zip = ZCTA5)
Lastly rearrange to be more user-friendly.
zip_unique = Z4 %>% dplyr::select(Zip, Fips, county, state_name, ST, STATE:CountyArea)
write.csv(zip_unique, "Cleaned Data/Zip Unique.csv")
A frequency of the number of farmers market per county will be created. Missing or incomplete county names will need to be imputed by zip code. If a zip code is missing, the zipcode package can match cities to zip codes, although some cities span more than one zip code.
dim(farmers_markets)
[1] 8804 59
str(farmers_markets[ ,1:30])
'data.frame': 8804 obs. of 30 variables:
$ FMID : int 1018261 1018318 1009364 1010691 1002454 1011100 1009845 1005586 1008071 1012710 ...
$ MarketName : chr " Caledonia Farmers Market Association - Danville" " Stearns Homestead Farmers' Market" "106 S. Main Street Farmers Market" "10th Steet Community Farmers Market" ...
$ Website : chr "https://sites.google.com/site/caledoniafarmersmarket/" "http://www.StearnsHomestead.com" "http://thetownofsixmile.wordpress.com/" "" ...
$ Facebook : chr "https://www.facebook.com/Danville.VT.Farmers.Market/" "StearnsHomesteadFarmersMarket" "" "" ...
$ Twitter : chr "" "" "" "" ...
$ Youtube : chr "" "" "" "" ...
$ OtherMedia : chr "" "" "" "http://agrimissouri.com/mo-grown/grodetail.php?type=mo-grown&ID=275" ...
$ street : chr "" "6975 Ridge Road" "106 S. Main Street" "10th Street and Poplar" ...
$ city : chr "Danville" "Parma " "Six Mile" "Lamar " ...
$ County : chr "Caledonia" "Cuyahoga" "Pickens" "Barton" ...
$ State : chr "Vermont" "Ohio" "South Carolina" "Missouri" ...
$ zip : chr "5828" "" "29682" "64759" ...
$ Season1Date: chr "06/14/2017 to 08/30/2017" "06/24/2017 to 09/30/2017" "" "04/02/2014 to 11/30/2014" ...
$ Season1Time: chr "Wed: 9:00 AM-1:00 PM;" "Sat: 9:00 AM-1:00 PM;" "" "Wed: 3:00 PM-6:00 PM;Sat: 8:00 AM-1:00 PM;" ...
$ Season2Date: chr "09/06/2017 to 10/18/2017" "" "" "" ...
$ Season2Time: chr "Wed: 2:00 PM-6:00 PM;" "" "" "" ...
$ Season3Date: chr "" "" "" "" ...
$ Season3Time: chr "" "" "" "" ...
$ Season4Date: chr "" "" "" "" ...
$ Season4Time: chr "" "" "" "" ...
$ x : num -72.1 -81.7 -82.8 -94.3 -73.9 ...
$ y : num 44.4 41.4 34.8 37.5 40.8 ...
$ Location : chr "" "" "" "" ...
$ Credit : chr "Y" "Y" "Y" "Y" ...
$ WIC : chr "Y" "N" "N" "N" ...
$ WICcash : chr "N" "N" "N" "N" ...
$ SFMNP : chr "Y" "Y" "N" "N" ...
$ SNAP : chr "N" "N" "N" "N" ...
$ Organic : chr "Y" "-" "-" "-" ...
$ Bakedgoods : chr "Y" "Y" "" "Y" ...
Inspecting the zip codes.
table(nchar(farmers_markets$zip))
0 1 2 3 4 5 6 10
947 1 10 41 843 6950 2 10
sum(is.na(farmers_markets$zip))
[1] 0
Some zip code are missing leading 0’s, some have a hyphenated suffix, some have mistakes, and some are missing. We add an ID column and add leading 0’s to 3 and 4 digit zips. Additionally, we need the county names to match up correctly in order to locate the FIPS code. So we trim whitespace, fix capitalization, and add Parish to county names in Louisiana.
farmers_markets = cbind( ID = 1:nrow(farmers_markets), farmers_markets)
F1 = farmers_markets %>% mutate(City = str_to_title(trimws(city, which = "both")),
State = str_to_title(trimws(State)),
County = str_to_title(County),
County = str_replace(County, "County",""),
County = trimws(County, which = "both"),
ST = state2abbr(State),
zip = ifelse(nchar(zip) == 3, paste0("0",zip), zip),
zip = ifelse(nchar(zip) == 4, paste0("0",zip), zip),
zip = ifelse(nchar(zip) == 10, str_sub(zip, 1, 5), zip),
County = ifelse(State == "LA", paste0(County, "Parish"),
paste(County, "County", sep =" "))) %>%
select(ID, City, County, State, ST, zip)
Next we use the zipcode package to find zip codes using city and state names if a remainig zip code does not have five digits.
data("zipcode")
F2 = F1 %>% mutate(zip = ifelse(nchar(zip == 5), zip,
zipcode$zip[zipcode$city == City & zipcode$state == ST] ),
County = ifelse(ST == "DC", str_replace(County, "County", ""), County))
sum(is.na(F2$County))
[1] 42
There are stll a few missing counties, which we will deal with later.
F3 = left_join(F2, County_Code_Ref, by = c(c("County"="county"), c("ST" = "state")), all.x=T) %>%
mutate(Fips = ifelse(ST == "DC", "11001", Fips)) %>%
select(-state_name)
Now we drop our first observations outside the lower 49 states and find out how many FIPS codes are still missing.
dropped_IDs = F3$ID[F3$State %in% c("Alaska", "Puerto Rico", "Virgin Islands")] ## create running vector of dropped ID's
F4 = F3 %>% filter(!State %in% c("Alaska", "Puerto Rico", "Virgin Islands"))
sum(is.na(F4$Fips))
[1] 326
We did drop 83 farmers markets, but have 326 FIPS codes to find. We put these observations into a new data frame. Let’s see if there zip codes seem reliable.
na_county = F4[is.na(F4$Fips), c(1,2,3,6)]
table(nchar(na_county$zip))
0 2 5
44 2 280
Most of these missing do have zip codes so let’s see if we can find the counties using the zip code reference table.
found_county = left_join(na_county, zip_unique, by = c("zip" = "Zip" ) ) %>% select(ID:ST)
These found counties will be joined to the running farmers market table.
F5 = left_join(F4, found_county, by = "ID")
F6 = F5 %>% mutate(County = ifelse(is.na(County.x), County.y, County.x),
Fips = ifelse(is.na(Fips.x), Fips.y, Fips.x)) %>%
select(ID, City.x, County, State, ST.x, zip.x, state_code, county_code, Fips )
colnames(F6) = gsub(".x", "", colnames(F6))
sum(is.na(F6$Fips))
[1] 52
Now we’re left with 52 Fips to find.
As many of the last few remaining NA’s are located.
na_county2 = F6[is.na(F6$Fips), ] %>%
mutate(County = str_replace(County, "County", ""),
County = trimws(County, which = "both"),
County = ifelse(ST == "LA", paste(County, "Parish", sep = " "),
paste(County, "County", sep =" ")),
County = str_replace(County, "St.", "Saint"),
County = ifelse(str_sub(County, 1, 2) %in% c("De", "Mc", "O'", "Du"),
paste0(str_sub(County, 1,2), str_to_upper(str_sub(County, 3,3)), str_sub(County, 4,-1)), County),
County = ifelse(City == "Colorado Springs", "El Paso County", County),
County = ifelse(City == "St. Louis", "St. Louis County", County),
County = ifelse(County == "Fond Du Lac County", "Fond du Lac County", County)
)
found_county2 = left_join(na_county2, County_Code_Ref, by = c(c("County" = "county"), c("ST" = "state")))
Join these last counties that were able to be matched to the running farmers market table, and drop the remaining.
dropped_IDs = c(dropped_IDs,found_county2$ID[is.na(found_county2$Fips.y)])
found_county2 = found_county2[!is.na(found_county2$Fips.y),] %>%
select(ID, City, County, State, ST, zip,
Fips.y, state_code.y, county_code.y)
## merge with running farmers market table
F7 = left_join(F6, found_county2, by = "ID") %>%
mutate( Fips = ifelse(is.na(Fips), Fips.y, Fips),
County = ifelse(is.na(Fips), County.y, County.x)) %>%
select(ID, City.x, County.x, State.x, ST.x, zip.x, state_code,
county_code, Fips)
colnames(F7) = gsub(".x","", colnames((F7)))
We create a CSV file containing each farmers market’s FIPS code.
length(dropped_IDs) ## only had to drop 116 of 8,804! (some are mobile markets)
[1] 116
write.csv(F7, "Cleaned Data/FarmersMarketEach.csv") ## each market
Droppped_FM = farmers_markets[dropped_IDs, ] ##data frame of dropped markets
Of our 8,804 original farmers markets, we dropped 116 which were either outside the lower 49 states or we were unable to match the market to one county. It should be noted that some of these markets are “mobile markets.”
Next we write a csv file with each market. This is not the frequency table we’re after, but this is a useful table for later applications.
## lastly make a freq table by each Fips. Only lower 49 states
Fips_table = data.frame(table(F7$Fips)) %>% rename( Fips = Var1) %>% mutate_if(is.factor, as.character)
F8 = left_join(County_Code_Ref %>% filter(!state %in% c("AK","PR","VI")),
Fips_table, by = "Fips", all.x = TRUE) %>%
mutate(Freq = ifelse(is.na(Freq), 0, Freq))
AllCountyFMfreq = F8
write.csv(F8, "Cleaned Data/AllCountyFarmersMarketFreq.csv")
The last step now is to join the Election, USDA, Zip Code, and Farmers Market frequency tables together.
Below we join the election and USDA data to the farmer market frequency table, and inspect any NA’s.
V1 = full_join(AllCountyFMfreq %>% dplyr::select(Fips,county, state, state_name, Freq), USDA_data_cleaned, by = "Fips")
V12 = full_join(V1, Election2016Cleaned, by = "Fips") %>% dplyr::select(-State.x, -County.x, -State.ANSI, -County.ANSI)
V2 = full_join(V12, Census_data_cleaned, by ="Fips")
tot_naF = apply(V2, 2, function(x) length(which(is.na(x))))
tot_naF[tot_naF > 0 ]
Ag.District
58
AnimalSales
58
AnimalCV
58
CropSales
58
CropCV
58
FruitNutSales
58
FruitNutCV
58
VeggieSales
58
VeggieCV
58
AgLandAcres
58
AgLandCV
58
State.y
12
ST
12
County.y
12
Precincts
12
Votes
12
Republicans.2016
12
Democrats.2016
12
Green.2016
12
Libertarians.2016
12
Republicans.2012
12
Republicans.2008
12
Democrats.2012
12
Democrats.2008
12
Less.Than.High.School.Diploma
12
At.Least.High.School.Diploma
12
At.Least.Bachelors.s.Degree
12
Graduate.Degree
12
School.Enrollment
12
Median.Earnings.2010
12
White..Not.Latino..Population
12
African.American.Population
12
Native.American.Population
12
Asian.American.Population
12
Other.Race.or.Races
12
Latino.Population
12
Children.Under.6.Living.in.Poverty
12
Adults.65.and.Older.Living.in.Poverty
12
Total.Population
12
Preschool.Enrollment.Ratio.enrolled.ages.3.and.4
12
Poverty.Rate.below.federal.poverty.threshold
12
Gini.Coefficient
12
Child.Poverty.living.in.families.below.the.poverty.line
12
Management.professional.and.related.occupations
12
Service.occupations
12
Sales.and.office.occupations
12
Farming.fishing.and.forestry.occupations
12
Construction.extraction.maintenance.and.repair.occupations
12
Production.transportation.and.material.moving.occupations
12
White
12
Black
12
Hispanic
12
Asian
12
Amerindian
12
Other
12
White..Asian
12
Sire.Homogeneity
12
Median.Age
12
lon
12
lat
12
Poor.physical.health.days
12
Poor.mental.health.days
12
Low.birthweight
12
Teen.births
12
Children.in.single.parent.households
12
Adult.smoking
12
Adult.obesity
12
Diabetes
12
Sexually.transmitted.infections
12
HIV.prevalence.rate
12
Uninsured
12
Unemployment
12
Violent.crime
12
Injury.deaths
12
CA
12
S
12
MAR
12
CFS
12
ACFS
12
Mean.Alc
12
Max.Alc
12
Mixedness
12
elevation
12
Annual.Prcp
12
Winter.Prcp
12
Summer.Prcp
12
Spring.Prcp
12
Autumn.Prcp
12
Annual.Tavg
12
Annual.Tmax
12
Annual.Tmin
12
Winter.Tavg
12
Winter.Tmax
12
Winter.Tmin
12
Summer.Tavg
12
Summer.Tmax
12
Summer.Tmin
12
Spring.Tavg
12
Spring.Tmax
12
Spring.Tmin
12
Autumn.Tavg
12
Autumn.Tmax
12
Autumn.Tmin
12
temp
12
precip
12
NAME
12
FamilyRatio
12
MarriedHouseholdRatio
12
RenterOccupied
12
AveHousehouldSize
12
AveFamilySize
12
TotalFemale
12
MaleUnder20
12
Male20to24
12
Male25to44
12
Male44to54
12
Male55to59
12
MaleOver59
12
FemaleUnder20
12
Female20to24
12
Female25to44
12
Female44to54
12
Female55to59
12
FemaleOver59
12
HusbandWifeFamilyRatio
12
Here is a view of our last few NA’s.
final_nas1 = V2[is.na(V2$Ag.District), 1:16]
final_nas2 = V2[is.na(V2$MAR), 1:16]
final_nas3 = rbind(final_nas1,final_nas2)
final_nas4 = final_nas3[!duplicated(final_nas3$Fips), ]
final_nas4
The 12 rows missing election data and census data, which all have 0 farmers market frequency, are dropped.
## drop rows with both missing Election2016 data from running final table
V3 = V2 %>% filter( !is.na(MAR)) ## all row is missing and no Freq's lost
dropped_fips = setdiff(V2$Fips, V3$Fips)
The county area and housing unit from the zips table needs to be added.
V4 = left_join(V3, zip_unique %>% dplyr::select(Fips, CountyArea, CountyHU), by = "Fips") %>% distinct() %>% dplyr::select( -state, -County.y, -State.y)
Last use state medians to impute missing USDA data. The state of Virginia has multiple entries imputed.
V5 = V4 %>% group_by(state_name) %>%
mutate(CountyArea = ifelse(is.na(CountyArea), medianWithoutNA(CountyArea), CountyArea),
CountyHU = ifelse(is.na(CountyHU), medianWithoutNA(CountyHU), CountyHU),
AnimalSales = ifelse(is.na(AnimalSales), medianWithoutNA(AnimalSales), AnimalSales),
AnimalCV = ifelse(is.na(AnimalCV), medianWithoutNA(AnimalCV), AnimalCV),
FruitNutSales = ifelse(is.na(FruitNutSales ), medianWithoutNA(FruitNutSales ), FruitNutSales ),
FruitNutCV = ifelse(is.na(FruitNutCV), medianWithoutNA(FruitNutCV), FruitNutCV),
VeggieSales = ifelse(is.na(VeggieSales), medianWithoutNA(VeggieSales), VeggieSales),
VeggieCV = ifelse(is.na(VeggieCV), medianWithoutNA(VeggieCV), VeggieCV),
CropSales = ifelse(is.na(CropSales), medianWithoutNA(CropSales), CropSales),
CropCV = ifelse(is.na(CropCV ), medianWithoutNA(CropCV ), CropCV ),
AgLandAcres = ifelse(is.na(AgLandAcres), medianWithoutNA(AgLandAcres), AgLandAcres),
AgLandCV = ifelse(is.na(AgLandCV), medianWithoutNA(AgLandCV), AgLandCV)
) %>% ungroup()
Let’s see what is still missing.
Since agriculture is likely limited in DC, it will be assigned the minimum value for each missing agricultural variable.
V5[291,6:15] = as.list(apply(V5[ ,6:15],2, function(x) min(x, na.rm = TRUE)))
Last remaining NA’s.
V5[is.na(V5$Ag.District),1:8 ]
Many Virginia cities are missing the agricultural district name, so those missing will get their own category. Otherwise the rest will be in an “unknown” category.
V5[is.na(V5$Ag.District) & V5$ST == "VA", 5] = "Virginia (Unknown)"
V5[is.na(V5$Ag.District), 5] = "Unknown"
V5 = data.frame(V5)
Let’s check to see if all NA’s are gone and the frequencies add up correctly.
tot_na5 = apply(V5, 2, function(x) length(which(is.na(x))))
tot_na5[tot_na5 > 0 ]
named integer(0)
sum(V5$Freq) ## adds up correctly! 8804 original with 116 dropped
[1] 8688
8804 - 116
[1] 8688
We add row names for presentation by concatenating county nd state names. Then we rename and reorder the columns.
rn = make.names( paste(V5$county, V5$ST, sep = " "), unique = TRUE)
V55 = V5 %>% mutate(RowName = rn) %>% rename(County = county, State = state_name)
V6 = V55 %>% dplyr::select(Fips, RowName, County, State, Ag.District, Freq, CountyArea, CountyHU, AnimalSales:CountyHU)
Let’s make some derived variable looking at the change in votes from 2008 and 2012 for both politcal parties. Also inlude the range from min amd max temeratures as a measure of temperature variability. Additionally, some of the race and weather variables are redundant, so they will be dropped.
V66 = V6 %>% mutate(
DemChange08 = 100*(Democrats.2016 - Democrats.2008)/Democrats.2008,
DemChange12 = 100*(Democrats.2016 - Democrats.2012)/Democrats.2012,
RepChange08 = 100*(Republicans.2016 - Republicans.2008)/Republicans.2012,
RepChange12 = 100*(Republicans.2016 - Republicans.2012)/Republicans.2008,
OtherRace = Other,
TempRange = Annual.Tmax - Annual.Tmin) %>% dplyr::select(-Asian.American.Population,
-Native.American.Population, -Annual.Prcp, -White..Not.Latino..Population, -Latino.Population,
-African.American.Population, -Other.Race.or.Races, -Other )
And we are done!
All_Final_Data_Cleaned = V66 %>% dplyr::select(-ST, -NAME)
rownames(All_Final_Data_Cleaned) = All_Final_Data_Cleaned$RowName
write.csv(All_Final_Data_Cleaned, "Cleaned Data/All_Final_Data_Cleaned.csv")
head(All_Final_Data_Cleaned)
Our final data set contains 3,114 rows (counties) of 126 variables.
Derived Variables:
DemChange08, DemChange12, RepChange08, RepChange12, TempRange, FamilyRatio, MarriedHouseholdRatio, RenterOccupied, AveHousehouldSize, AveFamilySize, TotalFemale, MaleUnder20, Male20to24, Male25to44, Male44to54, Male55to59, MaleOver59, FemaleUnder20, Female20to24, Female25to44, Female44to54, Female55to59, FemaleOver59, HusbandWifeFamilyRatio
Droppped rows IDs from original farmers markets data set.
dropped_IDs
[1] 196 197 217 1158 1183 1313 1605 1618 1885 1917 2290 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304
[25] 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2435 2476 2564
[49] 2696 3330 3593 3695 3730 3972 4019 4044 4057 4058 4179 4925 4926 4927 5236 5240 5269 5286 5296 5333 5569 6595 6876 7009
[73] 7050 7070 7071 7182 7474 7484 7675 7849 7918 8272 8598 1157 1916 2038 2533 2538 2541 2598 2690 2767 2891 2942 3084 3117
[97] 3226 3261 3703 4720 4930 5074 5787 6334 6430 6464 6484 6728 7068 7312 7480 7801 7859 8014 8513 8766
Let’s start by making some aliases to facilitate the coding.
## alias for master tables
D1 = All_Final_Data_Cleaned
D2 = farmers_markets
D3 = FarmersMarketEach
Use tidycensus to upload county geography using the API (requires internet connection and may take a few minutes). I also had geo.shapes file on the election metadata. I think there is a difference in FIPS codes since the 2015 switch between the two files, so I’ll probably end up using the geo variable.
Here is some basic info.
## Pie chart
B1 = data.frame(table(D1$Freq)) %>% rename( Markets = Var1) ## freq tables of # of markets
B2 = B1 %>% mutate(Y = ifelse(B1$Markets != "0" ,1,0)) ## make an "at least one" category
X3 = c("Zero", "At least One") ## name it
Y3 =c()
Y3[1] = sum(B2$Freq[B2$Y == 0]) ## total 0's
Y3[2] = sum(B2$Freq[B2$Y == 1]) ## total "at least one"
B3 = data.frame(X3, Y3 ) %>% rename( Freq = Y3)
pie <- ggplot(B3, aes(x="", y=Freq, fill=X3))+
geom_bar(width = 1, stat = "identity") + coord_polar("y")
PieChart = pie + theme_minimal() + ggtitle( "US Farmers Markets Per County") +
theme(plot.title = element_text( face = "bold")) +
ylab("n = 3,114 counties") + xlab("") +
labs(fill = "") + geom_text(aes(y = Freq/2 + c(0, cumsum(Freq)[-length(Freq)]),
label = paste(round((Freq/sum(B3[ ,2]))*100,2), "%",sep="")), size=3) +
scale_fill_manual(values = c("khaki2", "olivedrab4"))
PieChart
Nearly 3/4 of counties have at least one farmers market.
ggplot(All_Final_Data_Cleaned, aes(x=Freq, )) + geom_histogram( fill = "olivedrab4", binwidth = 1) + labs( x= "Number of Farmers Market per County", y ="Frequency", title = "Distribution of Response Variable") + theme(plot.title = element_text(face = "bold", size = 14))
The distribution of the number of farmers markets per county is skewed right. Most counties have 0-2 farmers markets. There are some outliers, like Los Angeles County with 128 farmers markets (and about 10 million residents).
Since the response variable is a count variable, it is likely to follow either a Poisson or negative binomial distribution, if there is no overdispersion present. A zero inflated model should be investigated, as there may be factors present in some counties that prevent the county from having any farmers markets.
Next let’s see the distribution of markets by state.
B3 = data.frame(table(FarmersMarketEach$State)) %>% rename( State = Var1) %>%
arrange(Freq) %>% filter(State != "Virgin Islands")
BarchartStates = B3 %>%
ggplot(aes(x = reorder(State, Freq), y = Freq)) +
geom_bar(stat="identity", width=0.5, fill = "olivedrab4",
color = "gold") + coord_flip() +
ylab("Number of Farmers Markets by State") + xlab("") +
theme(axis.text.y = element_text(color = "grey20",
size = 6.5, angle = 0, face = "plain"),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold")) +
geom_text(aes(label=Freq), hjust=-.3, color="blue",
position = position_dodge(0.9), size=2.6) +
scale_y_continuous(limits = c(0, 800))
BarchartStates
It is natural to inspect the relationship on the number of farmers markets between county population and median income.
ggplot(All_Final_Data_Cleaned, aes(x = Total.Population, y = Median.Earnings.2010 )) + geom_point( aes(size = Freq, color = Freq)) + guides(size=FALSE) + labs(title = "Population and Income on County Farmers Markets") + geom_label_repel(aes(label=ifelse(Freq>45 ,as.character(RowName),'')),hjust=2,vjust=3, size = 3 ) + scale_x_continuous(labels = comma) + scale_color_distiller( palette = "YlOrBr", direction = 1, name = "Farmers \nMarkets \nFreq.")
There is a moderate positive relatioship between the number of farmers markets (size and color of bubbles) and county population. There is also a weak positive relationship between the numer of farmers markets and county 2010 median income.
ggplot(All_Final_Data_Cleaned, aes(x = AveFamilySize, y = Freq )) + geom_point() + labs(title = "Average Family Size and County Farmers Markets") + geom_label_repel(aes(label=ifelse(Freq>45 ,as.character(RowName),'')),hjust=2,vjust=3, size = 3 ) + scale_x_continuous(labels = comma)
Let’s upload shape files from the Census Bureau.
G1 = left_join(Geography %>% dplyr::select(GEOID, geometry), All_Final_Data_Cleaned, by = c( "GEOID" = "Fips" )) %>% filter(State != "Hawaii") %>% dplyr::select( -GEOID) ## add geometry shape to final data
rownames(G1) = G1$RowName
brks = c(0,1,2,4,8,16,32,128)
tm_view()
$tm_layout
$tm_layout$style
[1] NA
attr(,"class")
[1] "tm"
tm_shape(G1) + tm_fill("Freq", breaks = brks, palette = "YlGn") +
tm_layout(legend.position = c(.87,.25),
legend.text.size = .5,
title.position = c("center","top"),
title = "Number of Farmers Markets by County",
title.fontface = "bold" ,
title.size = 1)
brks2 = c(0,2500,5000,10000,20000,50000,100000, 200000, 500000, 1000000, 3000000, 12000000)
tm_view()
$tm_layout
$tm_layout$style
[1] NA
attr(,"class")
[1] "tm"
tm_shape(G1) + tm_fill("Total.Population", breaks = brks2, palette = "BuGn") +
tm_layout(legend.position = c(.87,.25),
legend.text.size = .5,
title.position = c("right","bottom"),
title = "Population by County",
title.fontface = "bold" ,
title.size = 1)
Add "markets per 1,000 residents’ variable.
G1$MPT = 1000* G1$Freq / G1$Total.Population
tm_view()
$tm_layout
$tm_layout$style
[1] NA
attr(,"class")
[1] "tm"
tm_shape(G1) + tm_fill("MPT", palette = "YlOrBr") +
tm_layout(legend.position = c(.87,.25),
legend.text.size = .5,
title.position = c("center","top"),
title = "Farmers Markets Per 1,000 Residents",
title.fontface = "bold" ,
title.size = 1)
PCA is used to investigate the relationship between the individuals and variables.
Create a new data frame for PCA use. PCA will use Ag.District and State as supplemental qualitative variables, and Freq as a supplemental quantitiative variable. All other quantitative variables will be used to construct the dimensions.
D1 = All_Final_Data_Cleaned
rownames(D1) = All_Final_Data_Cleaned$RowName
D2 = apply(D1[ ,c( 4:5)], 2, as.factor) %>% data.frame()
D3 = apply(D1[ ,c(6:ncol(D1))], 2, as.numeric) %>% data.frame()
D4 = cbind(D2, D3)
D4$State = as.factor(D4$State) # complete unstandarized data set
D4$Ag.District = as.factor(D4$Ag.District) # complete unstandarized data set
str(D4[ ,1:10])
'data.frame': 3114 obs. of 10 variables:
$ State : Factor w/ 50 levels "Alabama","Arizona",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Ag.District : Factor w/ 85 levels "ALL COUNTIES",..: 2 11 85 76 32 2 11 32 76 32 ...
$ Freq : num 1 4 4 1 1 0 2 2 1 1 ...
$ CountyArea : num 6.26e+09 3.01e+10 7.89e+09 1.03e+10 7.32e+09 ...
$ CountyHU : num 72845 606886 39456 54553 104094 ...
$ AnimalSales : num 8.92e+06 1.88e+07 9.35e+07 1.95e+06 2.30e+08 ...
$ AnimalCV : num 12.7 12.7 12.7 12.7 12.7 12.7 12.7 12.7 12.7 12.7 ...
$ CropSales : num 1.25e+07 1.02e+08 1.21e+07 2.25e+06 1.32e+07 ...
$ CropCV : num 16.3 16.3 16.3 16.3 16.3 16.3 16.3 16.3 16.3 16.3 ...
$ FruitNutSales: num 185500 2986000 435000 113000 167000 ...
apply(D4, 2, function(x) sum(is.na(x)))[apply(D4, 2, function(x) sum(is.na(x))) > 0] ##how many NA's > 0
named integer(0)
For the PCA, we assume a linear reationship between the variables. Variables are normalized by scale.unit = TRUE option. Frequency is a supplemental quanititative variable that is not used in the construction of the dimensions.
fm.pca = PCA(D4, scale.unit = TRUE, ncp = 30, quanti.sup = 3, quali.sup = 1:2, graph = TRUE)
Here is the scree plot.
barplot(fm.pca$eig[,2][1:30], main = "scree plot",
ylab = "percent of total variance",
xlab = "first 30 principle inertia", col = "skyblue")
There are noticible drops in the variation between dimensions 5 and 6 and also dimensions 13 and 14.
plot(1:30, fm.pca$eig[1:30 ,3], pch =20, xlab = "Dim. #", ylab = "Cumulative % of Variance")
lines(1:30, fm.pca$eig[1:30 ,3])
abline( h = 80, lty =2, col = "skyblue")
p1 = fviz_pca_biplot(fm.pca, axes = 1:2, select.ind = list(contrib = 50),
select.var = list(contrib = 30),
gradient.cols = "lightgreen", col.ind ="forestgreen",
repel = TRUE, labelsize = 2, col.var = "orange3" , col.ind.sup = "skyblue",
fill.ind = "olivedrab4",
col.quanti.sup = "red" , fill.var = "orange3", pointsize = .7 ) + theme_minimal() +
ggplot2::annotate("text", x = 12, y = 1, label = "Hotter", col = "red") +
ggplot2::annotate("text", x = -12, y = 1, label = "Colder", col ="blue") +
ggplot2::annotate("text", x = 0, y = 25, label = "Politcally Left", col = "blue") +
ggplot2::annotate("text", x = 0, y = -15, label = "Politcally Right", col ="red")
fviz_add(p1, data.frame(fm.pca2$quali.sup$coord[1:50, ], fm.pca2$quali.sup$coord[1:50, ]), repel = TRUE, labelsize = 2.3, pointsize = .7 , color = "deepskyblue3")
# Contributions of variables to PC1
fviz_contrib(fm.pca, choice = "var", axes = 1, top = 30)
# Contributions of variables to PC2
fviz_contrib(fm.pca, choice = "var", axes = 2, top = 30)
The first dimension consists primarily of temperature data - warmer climates are on the positive side of the x-axis while colder climates are on the negative x-axis.
The second dimesnion seperates politcally left leaning counties on the positive side of the y-axis, with politically right leaning counties on the negative y-axis. Most of the “blue states” are above the x-axis, while “red states” are below the x-axis.
Variables, states, and/or counties near each other on the biplot between the first and second dimensions can be interprested as similar with respect to temperature and politcal leaning. For example, Los Angeles County, located near the top of the plot, has a large y-value, implying it is very left leaning politcally. It’s moderate positiion on the x-axis indicates the temperatures are moderately above average.
p2 = fviz_pca_biplot(fm.pca, axes = 3:4, select.ind = list(contrib = 50),
select.var = list(contrib = 30),
gradient.cols = "lightgreen", col.ind ="forestgreen",
repel = TRUE, labelsize = 2, col.var = "orange3" , col.ind.sup = "skyblue",
fill.ind = "olivedrab4",
col.quanti.sup = "red" , fill.var = "orange3", pointsize = .7 ) + theme_minimal() +
ggplot2::annotate("text", x = 9, y = 1, label = "Large Families", col = "red") +
ggplot2::annotate("text", x = -9, y = 1, label = "Small Families", col ="blue") +
ggplot2::annotate("text", x = 0, y = 10, label = "Wetter", col = "blue") +
ggplot2::annotate("text", x = 0, y = -20, label = "Drier", col ="red")
fviz_add(p2, data.frame(fm.pca2$quali.sup$coord[1:50, ], fm.pca2$quali.sup$coord[1:50, ]), repel = TRUE, labelsize = 2.3, pointsize = .7 , color = "deepskyblue3")
# Contributions of variables to PC3
fviz_contrib(fm.pca, choice = "var", axes = 3, top = 30)
# Contributions of variables to PC4
fviz_contrib(fm.pca, choice = "var", axes = 4, top = 30)
The thrid dimension largely seperates family size. The fourth dimesnsion seperates precipation.
p3 = fviz_pca_biplot(fm.pca, axes = 5:6, select.ind = list(contrib = 50),
select.var = list(contrib = 30),
gradient.cols = "lightgreen", col.ind ="forestgreen",
repel = TRUE, labelsize = 2, col.var = "orange3" , col.ind.sup = "skyblue",
fill.ind = "olivedrab4",
col.quanti.sup = "red" , fill.var = "orange3", pointsize = .7 ) + theme_minimal() +
ggplot2::annotate("text", x = 15, y = 1, label = "Large Families", col = "red") +
ggplot2::annotate("text", x = -15, y = 1, label = "Small Families", col ="blue") +
ggplot2::annotate("text", x = 0, y = 20, label = "More Diverse", col = "blue") +
ggplot2::annotate("text", x = 0, y = -15, label = "Less Diverse", col ="red")
fviz_add(p3, data.frame(fm.pca2$quali.sup$coord[1:50, ], fm.pca2$quali.sup$coord[1:50, ]), repel = TRUE, labelsize = 2.3, pointsize = .7 , color = "deepskyblue3")
# Contributions of variables to PC5
fviz_contrib(fm.pca, choice = "var", axes = 5, top = 30)
# Contributions of variables to PC6
fviz_contrib(fm.pca, choice = "var", axes = 6, top = 30)
All_Data_PCA = data.frame(cbind(fm.pca$ind$coord ,All_Final_Data_Cleaned$Freq)) %>% dplyr::select(V31, Dim.1:Dim.30) %>% rename(Freq = V31)
write.csv(All_Data_PCA, "All_Data_PCA.csv")
Goal- identify variables that best seperate county profiles
Choose 7 clusters (Drop in inertia gain). Choose best variables to represent each cluster.
fm.hcpc<-HCPC(fm.pca ,nb.clust=7 ,graph=TRUE, description = TRUE)
print("########## CLUSTER 1 ##########")
[1] "########## CLUSTER 1 ##########"
round(fm.hcpc$desc.var$quanti[[1]][ ,c(1,2,3,6)],3)
v.test Mean in category Overall mean p.value
elevation 26.763 9.629030e+02 3.990520e+02 0.000
Male55to59 26.585 8.500000e-02 7.000000e-02 0.000
MaleOver59 25.239 2.660000e-01 2.040000e-01 0.000
Median.Age 23.284 4.520300e+01 3.990300e+01 0.000
MarriedHouseholdRatio 22.177 8.370000e-01 7.630000e-01 0.000
Female55to59 21.935 8.100000e-02 7.000000e-02 0.000
lat 21.824 4.322800e+01 3.825200e+01 0.000
FemaleOver59 20.643 2.900000e-01 2.390000e-01 0.000
TempRange 19.594 2.576630e+02 2.297520e+02 0.000
AgLandAcres 19.230 2.120111e+05 7.235322e+04 0.000
Farming.fishing.and.forestry.occupations 18.347 4.330000e+00 2.110000e+00 0.000
At.Least.High.School.Diploma 15.141 8.832300e+01 8.300900e+01 0.000
Libertarians.2016 15.031 4.229000e+00 3.163000e+00 0.000
Sire.Homogeneity 14.733 8.460000e-01 7.190000e-01 0.000
Female44to54 14.430 1.590000e-01 1.490000e-01 0.000
CropCV 14.381 2.217100e+01 1.774700e+01 0.000
S 13.912 2.610000e-01 -3.460000e-01 0.000
Management.professional.and.related.occupations 13.899 3.398400e+01 2.982700e+01 0.000
White 13.364 9.110900e+01 7.903500e+01 0.000
White..Asian 12.900 9.149100e+01 8.010600e+01 0.000
Male44to54 12.570 1.600000e-01 1.500000e-01 0.000
Republicans.2012 12.428 6.822600e+01 5.964100e+01 0.000
Republicans.2016 12.403 7.264300e+01 6.359700e+01 0.000
CA 11.285 5.300000e-01 1.000000e-02 0.000
Republicans.2008 11.075 6.395000e+01 5.678500e+01 0.000
Injury.deaths 8.147 8.439300e+01 7.575400e+01 0.000
MAR 5.961 8.040000e-01 7.350000e-01 0.000
CFS 5.496 0.000000e+00 0.000000e+00 0.000
CountyArea 4.620 1.814377e+10 1.200905e+10 0.000
Max.Alc 3.711 1.000000e-03 1.000000e-03 0.000
Green.2016 3.543 9.490000e-01 8.500000e-01 0.000
AgLandCV 2.186 2.523400e+01 2.381600e+01 0.029
Construction.extraction.maintenance.and.repair.occupations 2.041 1.184000e+01 1.151900e+01 0.041
At.Least.Bachelors.s.Degree 2.011 1.980700e+01 1.899500e+01 0.044
AnimalCV 1.982 1.321300e+01 1.256700e+01 0.048
CountyHU -2.255 2.303762e+04 6.297451e+05 0.024
ACFS -2.281 0.000000e+00 0.000000e+00 0.023
OtherRace -2.478 1.374000e+00 1.582000e+00 0.013
Service.occupations -2.940 1.696900e+01 1.744900e+01 0.003
VeggieCV -3.150 1.619800e+01 1.838500e+01 0.002
HusbandWifeFamilyRatio -3.187 2.720000e-01 2.790000e-01 0.001
Adults.65.and.Older.Living.in.Poverty -4.411 1.040200e+01 1.152600e+01 0.000
Median.Earnings.2010 -5.036 2.425283e+04 2.543765e+04 0.000
Precincts -5.268 1.240200e+01 5.489400e+01 0.000
Hispanic -5.312 4.716000e+00 7.940000e+00 0.000
Graduate.Degree -5.420 5.470000e+00 6.445000e+00 0.000
Total.Population -5.966 1.075573e+04 9.775404e+04 0.000
Preschool.Enrollment.Ratio.enrolled.ages.3.and.4 -6.028 3.892700e+01 4.298400e+01 0.000
Freq -6.279 1.050000e+00 2.790000e+00 0.000
Asian -6.543 3.830000e-01 1.071000e+00 0.000
RenterOccupied -6.860 2.520000e-01 2.770000e-01 0.000
Votes -6.890 5.287850e+03 4.174833e+04 0.000
Gini.Coefficient -7.206 4.190000e-01 4.320000e-01 0.000
TotalFemale -7.403 4.930000e-01 5.010000e-01 0.000
FruitNutCV -8.090 1.754400e+01 2.496300e+01 0.000
HIV.prevalence.rate -8.129 7.211000e+01 1.491930e+02 0.000
Children.Under.6.Living.in.Poverty -9.330 1.969800e+01 2.487600e+01 0.000
Child.Poverty.living.in.families.below.the.poverty.line -10.187 1.660000e+01 2.114500e+01 0.000
Poverty.Rate.below.federal.poverty.threshold -10.741 1.228200e+01 1.547800e+01 0.000
Adult.obesity -11.299 2.830000e-01 3.060000e-01 0.000
Sexually.transmitted.infections -11.305 2.089210e+02 3.433310e+02 0.000
Low.birthweight -11.587 7.200000e-02 8.300000e-02 0.000
Violent.crime -12.094 1.365740e+02 2.514570e+02 0.000
FamilyRatio -12.226 6.490000e-01 6.780000e-01 0.000
Adult.smoking -12.281 1.770000e-01 2.110000e-01 0.000
Democrats.2008 -12.370 3.355300e+01 4.156800e+01 0.000
Black -12.545 3.750000e-01 8.830000e+00 0.000
Diabetes -12.984 9.400000e-02 1.070000e-01 0.000
FemaleUnder20 -13.154 2.330000e-01 2.540000e-01 0.000
Summer.Tmax -13.181 8.266970e+02 8.571330e+02 0.000
Democrats.2012 -13.607 2.911000e+01 3.851200e+01 0.000
Male20to24 -13.833 4.500000e-02 6.300000e-02 0.000
Production.transportation.and.material.moving.occupations -13.848 1.238000e+01 1.625200e+01 0.000
Teen.births -13.897 3.120900e+01 4.408800e+01 0.000
MaleUnder20 -14.393 2.440000e-01 2.700000e-01 0.000
Sales.and.office.occupations -14.523 2.049500e+01 2.284300e+01 0.000
Poor.physical.health.days -14.614 3.059000e+00 3.807000e+00 0.000
DemChange12 -14.782 -2.879300e+01 -1.946300e+01 0.000
Democrats.2016 -14.848 2.106500e+01 3.169000e+01 0.000
Poor.mental.health.days -15.165 2.858000e+00 3.536000e+00 0.000
Female20to24 -15.192 3.900000e-02 5.700000e-02 0.000
Less.Than.High.School.Diploma -15.654 1.155200e+01 1.691100e+01 0.000
AveFamilySize -16.364 2.803000e+00 2.923000e+00 0.000
Children.in.single.parent.households -16.585 2.360000e-01 3.160000e-01 0.000
Unemployment -16.860 5.500000e-02 7.700000e-02 0.000
DemChange08 -17.476 -3.919300e+01 -2.615100e+01 0.000
Winter.Tmax -18.035 3.526570e+02 4.490200e+02 0.000
Annual.Tmax -18.662 5.934230e+02 6.619730e+02 0.000
Spring.Tmax -19.405 5.853120e+02 6.591830e+02 0.000
AveHousehouldSize -20.053 2.284000e+00 2.479000e+00 0.000
Male25to44 -20.642 2.070000e-01 2.430000e-01 0.000
Autumn.Tmax -20.769 6.062780e+02 6.799320e+02 0.000
Summer.Prcp -20.796 7.284220e+02 1.108740e+03 0.000
Winter.Tavg -20.883 2.415120e+02 3.451100e+02 0.000
Summer.Tavg -21.830 6.833990e+02 7.396110e+02 0.000
lon -22.407 -1.040800e+02 -9.176500e+01 0.000
Spring.Tavg -22.621 4.559260e+02 5.392730e+02 0.000
Annual.Tavg -23.129 4.643930e+02 5.472120e+02 0.000
temp -23.129 8.022000e+00 1.262300e+01 0.000
Winter.Prcp -23.547 2.500120e+02 8.071530e+02 0.000
Winter.Tmin -23.555 1.290690e+02 2.404230e+02 0.000
Female25to44 -23.615 1.970000e-01 2.310000e-01 0.000
Autumn.Tavg -25.089 4.740800e+02 5.615880e+02 0.000
Spring.Tmin -25.212 3.258320e+02 4.187250e+02 0.000
Summer.Tmin -26.287 5.413290e+02 6.224800e+02 0.000
Annual.Tmin -26.458 3.357600e+02 4.322210e+02 0.000
Spring.Prcp -27.327 5.734290e+02 1.020002e+03 0.000
Autumn.Tmin -28.053 3.419450e+02 4.431530e+02 0.000
precip -29.366 5.069650e+02 9.825080e+02 0.000
Autumn.Prcp -30.453 4.426820e+02 9.303860e+02 0.000
print(c("########## CLUSTER 2 ##########"))
[1] "########## CLUSTER 2 ##########"
round(fm.hcpc$desc.var$quanti[[2]][ ,c(1,2,3,6)],3)
v.test Mean in category Overall mean p.value
At.Least.Bachelors.s.Degree 32.624 33.993 18.995 0.000
Graduate.Degree 29.911 12.569 6.445 0.000
Median.Earnings.2010 28.148 32975.998 25437.655 0.000
Management.professional.and.related.occupations 25.390 38.472 29.827 0.000
S 24.771 0.885 -0.346 0.000
HusbandWifeFamilyRatio 21.833 0.338 0.279 0.000
At.Least.High.School.Diploma 19.594 90.838 83.009 0.000
DemChange12 18.277 -6.331 -19.463 0.000
CA 17.245 0.915 0.010 0.000
DemChange08 16.772 -11.902 -26.151 0.000
Democrats.2016 16.049 44.763 31.690 0.000
School.Enrollment 15.881 79.375 74.986 0.000
Female25to44 15.574 0.255 0.231 0.000
Asian 15.505 2.926 1.071 0.000
Freq 14.812 7.462 2.790 0.000
Sales.and.office.occupations 13.967 25.415 22.843 0.000
Green.2016 12.924 1.260 0.850 0.000
Votes 12.403 116467.440 41748.328 0.000
Libertarians.2016 12.202 4.148 3.163 0.000
Democrats.2008 12.065 50.467 41.568 0.000
ACFS 11.866 0.000 0.000 0.000
Democrats.2012 11.813 47.804 38.512 0.000
lat 10.946 41.093 38.252 0.000
Preschool.Enrollment.Ratio.enrolled.ages.3.and.4 10.166 50.770 42.984 0.000
AveFamilySize 10.138 3.007 2.923 0.000
Male25to44 9.268 0.261 0.243 0.000
Total.Population 8.490 238689.182 97754.035 0.000
Mean.Alc 8.015 0.000 0.000 0.000
Female44to54 7.813 0.155 0.149 0.000
MarriedHouseholdRatio 7.330 0.791 0.763 0.000
Precincts 7.237 121.346 54.894 0.000
AveHousehouldSize 6.956 2.556 2.479 0.000
Female20to24 6.915 0.066 0.057 0.000
lon 6.200 -87.886 -91.765 0.000
Male20to24 5.902 0.072 0.063 0.000
White..Asian 5.614 85.747 80.106 0.000
Mixedness 5.568 0.240 -0.010 0.000
MaleUnder20 5.048 0.280 0.270 0.000
Male44to54 4.730 0.154 0.150 0.000
TotalFemale 4.530 0.506 0.501 0.000
FemaleUnder20 3.940 0.261 0.254 0.000
White 3.680 82.820 79.035 0.000
Autumn.Prcp 3.405 992.461 930.386 0.001
OtherRace 3.295 1.896 1.582 0.001
Max.Alc 2.626 0.001 0.001 0.009
Hispanic -2.087 6.498 7.940 0.037
Amerindian -3.371 0.410 1.541 0.001
Violent.crime -3.540 213.182 251.457 0.000
Black -4.410 5.447 8.830 0.000
AgLandAcres -4.633 34051.368 72353.224 0.000
Sexually.transmitted.infections -4.663 280.219 343.331 0.000
FruitNutCV -4.739 20.016 24.963 0.000
Gini.Coefficient -4.988 0.422 0.432 0.000
Female55to59 -5.016 0.067 0.070 0.000
CropCV -5.549 15.803 17.747 0.000
Male55to59 -5.610 0.066 0.070 0.000
MAR -5.735 0.658 0.735 0.000
Autumn.Tmin -7.495 412.369 443.153 0.000
Winter.Tmin -7.867 198.085 240.423 0.000
Unemployment -8.009 0.065 0.077 0.000
TempRange -8.079 216.651 229.752 0.000
Summer.Tmin -8.328 593.212 622.480 0.000
RepChange08 -8.525 1.930 11.582 0.000
Annual.Tmin -8.557 396.708 432.221 0.000
Winter.Tavg -9.206 293.124 345.110 0.000
Median.Age -9.295 37.494 39.903 0.000
Spring.Tmin -9.449 379.092 418.725 0.000
Autumn.Tavg -9.459 524.032 561.588 0.000
Service.occupations -9.690 15.649 17.449 0.000
Low.birthweight -9.969 0.072 0.083 0.000
temp -10.173 10.319 12.623 0.000
Annual.Tavg -10.173 505.744 547.212 0.000
Winter.Tmax -10.227 386.817 449.020 0.000
Poor.mental.health.days -10.258 3.014 3.536 0.000
Farming.fishing.and.forestry.occupations -10.665 0.642 2.110 0.000
Summer.Tavg -10.698 708.250 739.611 0.000
Spring.Tavg -11.001 493.130 539.273 0.000
Autumn.Tmax -11.228 634.603 679.932 0.000
Annual.Tmax -11.626 613.359 661.973 0.000
Children.in.single.parent.households -11.917 0.250 0.316 0.000
Republicans.2008 -12.016 47.937 56.785 0.000
Republicans.2012 -12.024 50.185 59.641 0.000
RepChange12 -12.056 -2.645 7.502 0.000
Spring.Tmax -12.239 606.145 659.183 0.000
Summer.Tmax -12.459 824.384 857.133 0.000
Poor.physical.health.days -14.152 2.983 3.807 0.000
Construction.extraction.maintenance.and.repair.occupations -14.299 8.959 11.519 0.000
MaleOver59 -14.849 0.162 0.204 0.000
FemaleOver59 -15.952 0.195 0.239 0.000
Adults.65.and.Older.Living.in.Poverty -16.752 6.667 11.526 0.000
Adult.smoking -16.799 0.158 0.211 0.000
Production.transportation.and.material.moving.occupations -16.922 10.865 16.252 0.000
Republicans.2016 -17.618 48.968 63.597 0.000
Adult.obesity -17.865 0.265 0.306 0.000
Poverty.Rate.below.federal.poverty.threshold -19.152 8.990 15.478 0.000
Children.Under.6.Living.in.Poverty -19.254 12.711 24.876 0.000
CFS -19.428 0.000 0.000 0.000
Diabetes -19.644 0.084 0.107 0.000
Less.Than.High.School.Diploma -19.883 9.162 16.911 0.000
Child.Poverty.living.in.families.below.the.poverty.line -20.694 10.635 21.145 0.000
Teen.births -20.834 22.109 44.088 0.000
Injury.deaths -20.949 50.468 75.754 0.000
Uninsured -20.981 0.119 0.179 0.000
print(c("########## CLUSTER 3 ##########"))
[1] "########## CLUSTER 3 ##########"
round(fm.hcpc$desc.var$quanti[[3]][ ,c(1,2,3,6)],3)
v.test Mean in category Overall mean p.value
Sire.Homogeneity 26.019 8.600000e-01 7.190000e-01 0.000
White 23.354 9.224400e+01 7.903500e+01 0.000
White..Asian 22.972 9.279900e+01 8.010600e+01 0.000
lat 20.721 4.121000e+01 3.825200e+01 0.000
RepChange12 18.509 1.607000e+01 7.502000e+00 0.000
Production.transportation.and.material.moving.occupations 16.718 1.917900e+01 1.625200e+01 0.000
RepChange08 16.366 2.177200e+01 1.158200e+01 0.000
At.Least.High.School.Diploma 13.246 8.592000e+01 8.300900e+01 0.000
S 12.612 -1.000000e-03 -3.460000e-01 0.000
lon 11.612 -8.777000e+01 -9.176500e+01 0.000
Male44to54 11.589 1.560000e-01 1.500000e-01 0.000
Median.Age 11.559 4.155000e+01 3.990300e+01 0.000
MarriedHouseholdRatio 11.361 7.870000e-01 7.630000e-01 0.000
FemaleOver59 10.793 2.560000e-01 2.390000e-01 0.000
CA 10.440 3.110000e-01 1.000000e-02 0.000
Libertarians.2016 9.635 3.591000e+00 3.163000e+00 0.000
Male55to59 9.459 7.300000e-02 7.000000e-02 0.000
Female44to54 8.889 1.530000e-01 1.490000e-01 0.000
Democrats.2008 8.377 4.496600e+01 4.156800e+01 0.000
MaleOver59 8.365 2.170000e-01 2.040000e-01 0.000
Summer.Prcp 7.950 1.199760e+03 1.108740e+03 0.000
Green.2016 6.900 9.700000e-01 8.500000e-01 0.000
Female55to59 6.223 7.200000e-02 7.000000e-02 0.000
CFS 5.792 0.000000e+00 0.000000e+00 0.000
Democrats.2012 5.156 4.074300e+01 3.851200e+01 0.000
Autumn.Prcp 4.964 9.801540e+02 9.303860e+02 0.000
Spring.Prcp 4.923 1.070365e+03 1.020002e+03 0.000
AgLandCV 4.612 2.569000e+01 2.381600e+01 0.000
Adult.smoking 4.470 2.190000e-01 2.110000e-01 0.000
Republicans.2016 3.774 6.532000e+01 6.359700e+01 0.000
Adult.obesity 3.283 3.100000e-01 3.060000e-01 0.001
School.Enrollment 2.584 7.537800e+01 7.498600e+01 0.010
precip 2.077 1.003565e+03 9.825080e+02 0.038
VeggieSales -2.139 2.536536e+06 6.327674e+06 0.032
AnimalCV -2.178 1.212300e+01 1.256700e+01 0.029
FruitNutSales -2.544 1.887515e+06 9.235039e+06 0.011
Freq -2.705 2.321000e+00 2.790000e+00 0.007
Diabetes -2.912 1.050000e-01 1.070000e-01 0.004
CountyHU -3.312 7.186757e+04 6.297451e+05 0.001
HusbandWifeFamilyRatio -3.849 2.730000e-01 2.790000e-01 0.000
FamilyRatio -3.859 6.720000e-01 6.780000e-01 0.000
Construction.extraction.maintenance.and.repair.occupations -4.072 1.111800e+01 1.151900e+01 0.000
Precincts -4.138 3.399900e+01 5.489400e+01 0.000
Amerindian -4.375 7.340000e-01 1.541000e+00 0.000
Poor.mental.health.days -4.405 3.413000e+00 3.536000e+00 0.000
Democrats.2016 -4.699 2.958500e+01 3.169000e+01 0.000
OtherRace -5.061 1.317000e+00 1.582000e+00 0.000
AveFamilySize -5.550 2.898000e+00 2.923000e+00 0.000
Republicans.2012 -5.702 5.717500e+01 5.964100e+01 0.000
Total.Population -5.864 4.422103e+04 9.775404e+04 0.000
Poor.physical.health.days -5.880 3.619000e+00 3.807000e+00 0.000
Female20to24 -6.137 5.300000e-02 5.700000e-02 0.000
Male20to24 -6.400 5.800000e-02 6.300000e-02 0.000
Votes -6.606 1.986192e+04 4.174833e+04 0.000
CountyArea -6.842 6.320820e+09 1.200905e+10 0.000
Graduate.Degree -6.984 5.659000e+00 6.445000e+00 0.000
Winter.Prcp -7.208 7.003780e+02 8.071530e+02 0.000
Male25to44 -7.286 2.350000e-01 2.430000e-01 0.000
Children.Under.6.Living.in.Poverty -7.731 2.219000e+01 2.487600e+01 0.000
Asian -7.839 5.550000e-01 1.071000e+00 0.000
elevation -7.891 2.949680e+02 3.990520e+02 0.000
Children.in.single.parent.households -8.015 2.920000e-01 3.160000e-01 0.000
Injury.deaths -8.078 7.039200e+01 7.575400e+01 0.000
Farming.fishing.and.forestry.occupations -8.138 1.494000e+00 2.110000e+00 0.000
MaleUnder20 -8.175 2.610000e-01 2.700000e-01 0.000
At.Least.Bachelors.s.Degree -8.607 1.681900e+01 1.899500e+01 0.000
Republicans.2008 -8.987 5.314600e+01 5.678500e+01 0.000
FemaleUnder20 -9.544 2.450000e-01 2.540000e-01 0.000
Female25to44 -9.707 2.220000e-01 2.310000e-01 0.000
Management.professional.and.related.occupations -9.856 2.798100e+01 2.982700e+01 0.000
FruitNutCV -10.173 1.912200e+01 2.496300e+01 0.000
Child.Poverty.living.in.families.below.the.poverty.line -10.364 1.825000e+01 2.114500e+01 0.000
AveHousehouldSize -10.550 2.415000e+00 2.479000e+00 0.000
HIV.prevalence.rate -10.769 8.525900e+01 1.491930e+02 0.000
AgLandAcres -10.995 2.236444e+04 7.235322e+04 0.000
Summer.Tmin -11.837 5.996020e+02 6.224800e+02 0.000
Violent.crime -12.178 1.790380e+02 2.514570e+02 0.000
ACFS -12.647 0.000000e+00 0.000000e+00 0.000
Poverty.Rate.below.federal.poverty.threshold -12.767 1.309900e+01 1.547800e+01 0.000
Adults.65.and.Older.Living.in.Poverty -13.014 9.450000e+00 1.152600e+01 0.000
RenterOccupied -13.144 2.470000e-01 2.770000e-01 0.000
Less.Than.High.School.Diploma -13.206 1.408000e+01 1.691100e+01 0.000
Hispanic -13.435 2.836000e+00 7.940000e+00 0.000
Max.Alc -13.453 0.000000e+00 1.000000e-03 0.000
CropCV -13.506 1.514500e+01 1.774700e+01 0.000
Teen.births -14.516 3.566600e+01 4.408800e+01 0.000
Autumn.Tmin -14.914 4.094680e+02 4.431530e+02 0.000
Black -15.440 2.316000e+00 8.830000e+00 0.000
Sexually.transmitted.infections -15.554 2.275570e+02 3.433310e+02 0.000
Low.birthweight -15.982 7.300000e-02 8.300000e-02 0.000
Mean.Alc -16.261 0.000000e+00 0.000000e+00 0.000
Gini.Coefficient -16.451 4.140000e-01 4.320000e-01 0.000
Annual.Tmin -17.279 3.927830e+02 4.322210e+02 0.000
Summer.Tavg -17.917 7.107270e+02 7.396110e+02 0.000
Spring.Tmin -18.158 3.768410e+02 4.187250e+02 0.000
Mixedness -18.431 -4.650000e-01 -1.000000e-02 0.000
MAR -18.645 5.980000e-01 7.350000e-01 0.000
DemChange08 -18.752 -3.491200e+01 -2.615100e+01 0.000
Autumn.Tavg -19.166 5.197380e+02 5.615880e+02 0.000
TempRange -19.182 2.126460e+02 2.297520e+02 0.000
Winter.Tmin -19.921 1.814670e+02 2.404230e+02 0.000
Spring.Tavg -21.130 4.905320e+02 5.392730e+02 0.000
temp -21.194 9.983000e+00 1.262300e+01 0.000
Annual.Tavg -21.194 4.997030e+02 5.472120e+02 0.000
DemChange12 -21.296 -2.787800e+01 -1.946300e+01 0.000
Uninsured -22.637 1.430000e-01 1.790000e-01 0.000
Autumn.Tmax -22.781 6.293550e+02 6.799320e+02 0.000
Winter.Tavg -23.188 2.730960e+02 3.451100e+02 0.000
Spring.Tmax -23.593 6.029570e+02 6.591830e+02 0.000
Summer.Tmax -23.833 8.226800e+02 8.571330e+02 0.000
Annual.Tmax -24.588 6.054300e+02 6.619730e+02 0.000
Winter.Tmax -25.445 3.639070e+02 4.490200e+02 0.000
print(c("########## CLUSTER 4 ##########"))
[1] "########## CLUSTER 4 ##########"
round(fm.hcpc$desc.var$quanti[[4]][ ,c(1,2,3,6)],3)
v.test Mean in category Overall mean p.value
Hispanic 28.895 3.336800e+01 7.940000e+00 0.000
FemaleUnder20 27.376 3.160000e-01 2.540000e-01 0.000
AveHousehouldSize 24.650 2.827000e+00 2.479000e+00 0.000
TempRange 22.294 2.757970e+02 2.297520e+02 0.000
AveFamilySize 22.232 3.159000e+00 2.923000e+00 0.000
elevation 22.208 1.077480e+03 3.990520e+02 0.000
AgLandAcres 21.717 3.010351e+05 7.235322e+04 0.000
MaleUnder20 20.934 3.230000e-01 2.700000e-01 0.000
HusbandWifeFamilyRatio 15.247 3.320000e-01 2.790000e-01 0.000
Uninsured 15.075 2.340000e-01 1.790000e-01 0.000
AnimalSales 14.905 1.962571e+08 6.295049e+07 0.000
Farming.fishing.and.forestry.occupations 14.637 4.678000e+00 2.110000e+00 0.000
Teen.births 13.589 6.234900e+01 4.408800e+01 0.000
FamilyRatio 13.249 7.230000e-01 6.780000e-01 0.000
Amerindian 13.061 7.126000e+00 1.541000e+00 0.000
CropSales 12.439 2.000772e+08 6.220057e+07 0.000
FruitNutSales 12.230 9.105897e+07 9.235039e+06 0.000
CountyArea 12.213 3.552485e+10 1.200905e+10 0.000
CropCV 11.102 2.269900e+01 1.774700e+01 0.000
VeggieSales 10.324 4.870766e+07 6.327674e+06 0.000
CFS 10.091 0.000000e+00 0.000000e+00 0.000
Less.Than.High.School.Diploma 10.056 2.190200e+01 1.691100e+01 0.000
MAR 9.487 8.950000e-01 7.350000e-01 0.000
DemChange12 8.949 -1.127300e+01 -1.946300e+01 0.000
Libertarians.2016 8.710 4.059000e+00 3.163000e+00 0.000
RenterOccupied 7.518 3.150000e-01 2.770000e-01 0.000
Summer.Tmax 7.227 8.813320e+02 8.571330e+02 0.000
Poverty.Rate.below.federal.poverty.threshold 5.968 1.805300e+01 1.547800e+01 0.000
DemChange08 5.869 -1.980000e+01 -2.615100e+01 0.000
Female25to44 5.792 2.420000e-01 2.310000e-01 0.000
Mixedness 5.591 3.100000e-01 -1.000000e-02 0.000
Construction.extraction.maintenance.and.repair.occupations 4.782 1.260900e+01 1.151900e+01 0.000
Child.Poverty.living.in.families.below.the.poverty.line 4.629 2.413900e+01 2.114500e+01 0.000
Male25to44 4.628 2.540000e-01 2.430000e-01 0.000
Male20to24 3.936 7.100000e-02 6.300000e-02 0.000
Female20to24 3.698 6.300000e-02 5.700000e-02 0.000
ACFS 3.531 0.000000e+00 0.000000e+00 0.000
Max.Alc 3.469 1.000000e-03 1.000000e-03 0.001
Republicans.2008 3.399 5.997400e+01 5.678500e+01 0.001
Injury.deaths 3.346 8.089900e+01 7.575400e+01 0.001
Service.occupations 3.203 1.820700e+01 1.744900e+01 0.001
Children.Under.6.Living.in.Poverty 3.058 2.733800e+01 2.487600e+01 0.002
Annual.Tmax 3.050 6.782190e+02 6.619730e+02 0.002
Sexually.transmitted.infections 3.028 3.955290e+02 3.433310e+02 0.002
Winter.Tmax 2.632 4.694070e+02 4.490200e+02 0.009
Republicans.2012 2.602 6.224700e+01 5.964100e+01 0.009
Green.2016 2.599 9.550000e-01 8.500000e-01 0.009
Spring.Tmax 2.567 6.733500e+02 6.591830e+02 0.010
FruitNutCV 2.409 2.816700e+01 2.496300e+01 0.016
Adults.65.and.Older.Living.in.Poverty 2.059 1.228600e+01 1.152600e+01 0.039
Winter.Tmin -2.044 2.264120e+02 2.404230e+02 0.041
Votes -2.074 2.583265e+04 4.174833e+04 0.038
Summer.Tavg -2.112 7.317270e+02 7.396110e+02 0.035
Freq -2.357 1.843000e+00 2.790000e+00 0.018
MarriedHouseholdRatio -2.709 7.490000e-01 7.630000e-01 0.007
Republicans.2016 -2.889 6.054100e+01 6.359700e+01 0.004
Democrats.2012 -2.927 3.557900e+01 3.851200e+01 0.003
Autumn.Tavg -3.006 5.463860e+02 5.615880e+02 0.003
Gini.Coefficient -3.074 4.240000e-01 4.320000e-01 0.002
Management.professional.and.related.occupations -3.158 2.845700e+01 2.982700e+01 0.002
AgLandCV -3.524 2.050000e+01 2.381600e+01 0.000
Democrats.2008 -3.580 3.820400e+01 4.156800e+01 0.000
Low.birthweight -3.647 7.800000e-02 8.300000e-02 0.000
Sales.and.office.occupations -3.796 2.195300e+01 2.284300e+01 0.000
S -3.845 -5.890000e-01 -3.460000e-01 0.000
At.Least.Bachelors.s.Degree -3.978 1.666600e+01 1.899500e+01 0.000
HIV.prevalence.rate -4.236 9.095400e+01 1.491930e+02 0.000
Graduate.Degree -4.605 5.244000e+00 6.445000e+00 0.000
Adult.obesity -4.679 2.920000e-01 3.060000e-01 0.000
Median.Earnings.2010 -4.836 2.378801e+04 2.543765e+04 0.000
Spring.Tmin -5.102 3.914700e+02 4.187250e+02 0.000
Production.transportation.and.material.moving.occupations -5.330 1.409100e+01 1.625200e+01 0.000
Preschool.Enrollment.Ratio.enrolled.ages.3.and.4 -5.586 3.753300e+01 4.298400e+01 0.000
Annual.Tmin -5.637 4.024220e+02 4.322210e+02 0.000
Adult.smoking -5.696 1.880000e-01 2.110000e-01 0.000
School.Enrollment -6.185 7.280800e+01 7.498600e+01 0.000
RepChange08 -6.772 1.816000e+00 1.158200e+01 0.000
Autumn.Tmin -7.105 4.059840e+02 4.431530e+02 0.000
TotalFemale -7.209 4.900000e-01 5.010000e-01 0.000
Black -7.303 1.693000e+00 8.830000e+00 0.000
CA -8.030 -5.270000e-01 1.000000e-02 0.000
RepChange12 -8.418 -1.523000e+00 7.502000e+00 0.000
Summer.Tmin -8.895 5.826660e+02 6.224800e+02 0.000
At.Least.High.School.Diploma -9.651 7.809800e+01 8.300900e+01 0.000
Diabetes -10.160 9.200000e-02 1.070000e-01 0.000
MaleOver59 -10.200 1.680000e-01 2.040000e-01 0.000
AnimalCV -10.453 7.628000e+00 1.256700e+01 0.000
Winter.Prcp -13.716 3.365840e+02 8.071530e+02 0.000
Male55to59 -13.825 5.900000e-02 7.000000e-02 0.000
Sire.Homogeneity -14.058 5.440000e-01 7.190000e-01 0.000
Female55to59 -14.666 6.000000e-02 7.000000e-02 0.000
FemaleOver59 -14.920 1.860000e-01 2.390000e-01 0.000
Female44to54 -16.369 1.320000e-01 1.490000e-01 0.000
Male44to54 -16.528 1.320000e-01 1.500000e-01 0.000
Median.Age -18.339 3.385000e+01 3.990300e+01 0.000
White -18.347 5.500200e+01 7.903500e+01 0.000
White..Asian -18.738 5.612700e+01 8.010600e+01 0.000
lon -20.212 -1.078720e+02 -9.176500e+01 0.000
Summer.Prcp -22.673 5.075190e+02 1.108740e+03 0.000
Autumn.Prcp -23.144 3.929670e+02 9.303860e+02 0.000
precip -23.751 4.248130e+02 9.825080e+02 0.000
Spring.Prcp -24.678 4.352590e+02 1.020002e+03 0.000
print(c("######### CLUSTER 5 ##########"))
[1] "######### CLUSTER 5 ##########"
round(fm.hcpc$desc.var$quanti[[5]][ ,c(1,2,3,6)],3)
v.test Mean in category Overall mean p.value
RenterOccupied 28.206 4.420000e-01 2.770000e-01 0.000
Votes 27.637 2.830042e+05 4.174833e+04 0.000
Total.Population 26.123 7.260978e+05 9.775404e+04 0.000
Asian 25.549 5.501000e+00 1.071000e+00 0.000
Female20to24 24.410 1.040000e-01 5.700000e-02 0.000
Freq 24.238 1.386900e+01 2.790000e+00 0.000
Precincts 24.212 3.770380e+02 5.489400e+01 0.000
Male20to24 23.380 1.130000e-01 6.300000e-02 0.000
Democrats.2016 22.855 5.866700e+01 3.169000e+01 0.000
Graduate.Degree 22.263 1.305100e+01 6.445000e+00 0.000
At.Least.Bachelors.s.Degree 20.543 3.268000e+01 1.899500e+01 0.000
HIV.prevalence.rate 19.595 4.557140e+02 1.491930e+02 0.000
CountyHU 19.267 9.181546e+06 6.297451e+05 0.000
Democrats.2012 18.468 5.956300e+01 3.851200e+01 0.000
Democrats.2008 18.175 6.099400e+01 4.156800e+01 0.000
DemChange08 17.971 -4.028000e+00 -2.615100e+01 0.000
DemChange12 17.256 -1.496000e+00 -1.946300e+01 0.000
Violent.crime 16.210 5.054570e+02 2.514570e+02 0.000
MAR 15.206 1.028000e+00 7.350000e-01 0.000
Female25to44 14.972 2.650000e-01 2.310000e-01 0.000
Management.professional.and.related.occupations 14.825 3.714200e+01 2.982700e+01 0.000
Sexually.transmitted.infections 13.740 6.128050e+02 3.433310e+02 0.000
Gini.Coefficient 13.288 4.690000e-01 4.320000e-01 0.000
OtherRace 12.733 3.342000e+00 1.582000e+00 0.000
Male25to44 12.227 2.770000e-01 2.430000e-01 0.000
CFS 11.987 0.000000e+00 0.000000e+00 0.000
Green.2016 10.847 1.349000e+00 8.500000e-01 0.000
CountyArea 10.780 3.562413e+10 1.200905e+10 0.000
Mixedness 9.552 6.120000e-01 -1.000000e-02 0.000
Sales.and.office.occupations 9.300 2.532400e+01 2.284300e+01 0.000
Winter.Tmin 8.579 3.073270e+02 2.404230e+02 0.000
Black 8.224 1.797400e+01 8.830000e+00 0.000
Autumn.Tmin 7.490 4.877250e+02 4.431530e+02 0.000
Winter.Tavg 7.398 4.056490e+02 3.451100e+02 0.000
Annual.Tmin 7.035 4.745310e+02 4.322210e+02 0.000
Max.Alc 6.986 1.000000e-03 1.000000e-03 0.000
Children.in.single.parent.households 6.928 3.710000e-01 3.160000e-01 0.000
Preschool.Enrollment.Ratio.enrolled.ages.3.and.4 6.819 5.055200e+01 4.298400e+01 0.000
Hispanic 6.763 1.471100e+01 7.940000e+00 0.000
ACFS 6.744 0.000000e+00 0.000000e+00 0.000
Spring.Tmin 6.385 4.575320e+02 4.187250e+02 0.000
Autumn.Tavg 6.300 5.978330e+02 5.615880e+02 0.000
Winter.Tmax 6.176 5.034570e+02 4.490200e+02 0.000
temp 5.839 1.453900e+01 1.262300e+01 0.000
Annual.Tavg 5.839 5.817040e+02 5.472120e+02 0.000
School.Enrollment 5.681 7.726100e+01 7.498600e+01 0.000
TotalFemale 5.475 5.100000e-01 5.010000e-01 0.000
Autumn.Tmax 5.088 7.096940e+02 6.799320e+02 0.000
Spring.Tavg 5.030 5.698450e+02 5.392730e+02 0.000
Annual.Tmax 4.761 6.908190e+02 6.619730e+02 0.000
Winter.Prcp 4.587 9.862030e+02 8.071530e+02 0.000
At.Least.High.School.Diploma 4.541 8.563800e+01 8.300900e+01 0.000
Service.occupations 4.459 1.865000e+01 1.744900e+01 0.000
Median.Earnings.2010 4.041 2.700585e+04 2.543765e+04 0.000
Summer.Tmin 3.998 6.428390e+02 6.224800e+02 0.000
Spring.Tmax 3.983 6.841960e+02 6.591830e+02 0.000
Summer.Tavg 3.354 7.538580e+02 7.396110e+02 0.001
CropSales 2.879 9.849885e+07 6.220057e+07 0.004
Libertarians.2016 2.852 3.497000e+00 3.163000e+00 0.004
Mean.Alc 2.813 0.000000e+00 0.000000e+00 0.005
FruitNutSales 2.806 3.059535e+07 9.235039e+06 0.005
Poverty.Rate.below.federal.poverty.threshold 2.735 1.682100e+01 1.547800e+01 0.006
S 2.637 -1.560000e-01 -3.460000e-01 0.008
VeggieSales 2.628 1.860103e+07 6.327674e+06 0.009
precip 2.300 1.043938e+03 9.825080e+02 0.021
Summer.Tmax 2.177 8.654270e+02 8.571330e+02 0.029
Unemployment 2.072 8.100000e-02 7.700000e-02 0.038
Autumn.Prcp 2.042 9.843310e+02 9.303860e+02 0.041
Amerindian -2.009 5.640000e-01 1.541000e+00 0.045
AgLandAcres -2.237 4.555692e+04 7.235322e+04 0.025
Children.Under.6.Living.in.Poverty -2.285 2.278400e+01 2.487600e+01 0.022
AnimalCV -2.528 1.120800e+01 1.256700e+01 0.011
FruitNutCV -2.633 2.097900e+01 2.496300e+01 0.008
AnimalSales -2.972 3.271525e+07 6.295049e+07 0.003
Poor.physical.health.days -4.212 3.452000e+00 3.807000e+00 0.000
Adults.65.and.Older.Living.in.Poverty -4.261 9.735000e+00 1.152600e+01 0.000
CA -4.370 -3.220000e-01 1.000000e-02 0.000
CropCV -4.483 1.547200e+01 1.774700e+01 0.000
elevation -4.491 2.429790e+02 3.990520e+02 0.000
lat -5.338 3.624500e+01 3.825200e+01 0.000
Less.Than.High.School.Diploma -5.620 1.373700e+01 1.691100e+01 0.000
TempRange -5.730 2.162880e+02 2.297520e+02 0.000
Teen.births -6.099 3.476400e+01 4.408800e+01 0.000
Farming.fishing.and.forestry.occupations -7.792 5.550000e-01 2.110000e+00 0.000
Adult.smoking -8.741 1.710000e-01 2.110000e-01 0.000
Diabetes -10.221 9.000000e-02 1.070000e-01 0.000
RepChange08 -10.256 -5.244000e+00 1.158200e+01 0.000
Injury.deaths -10.651 5.712500e+01 7.575400e+01 0.000
Adult.obesity -10.683 2.710000e-01 3.060000e-01 0.000
Female55to59 -11.273 6.100000e-02 7.000000e-02 0.000
White..Asian -11.468 6.341000e+01 8.010600e+01 0.000
RepChange12 -11.749 -6.829000e+00 7.502000e+00 0.000
Male55to59 -11.860 5.900000e-02 7.000000e-02 0.000
MarriedHouseholdRatio -12.719 6.920000e-01 7.630000e-01 0.000
Construction.extraction.maintenance.and.repair.occupations -12.913 8.169000e+00 1.151900e+01 0.000
Production.transportation.and.material.moving.occupations -13.187 1.016900e+01 1.625200e+01 0.000
Female44to54 -13.601 1.330000e-01 1.490000e-01 0.000
FemaleOver59 -13.747 1.840000e-01 2.390000e-01 0.000
White -14.176 5.790900e+01 7.903500e+01 0.000
Male44to54 -14.386 1.330000e-01 1.500000e-01 0.000
MaleOver59 -15.417 1.410000e-01 2.040000e-01 0.000
Median.Age -16.413 3.373900e+01 3.990300e+01 0.000
Sire.Homogeneity -17.438 4.720000e-01 7.190000e-01 0.000
Republicans.2008 -17.876 3.771100e+01 5.678500e+01 0.000
Republicans.2012 -18.557 3.849300e+01 5.964100e+01 0.000
FamilyRatio -19.865 6.010000e-01 6.780000e-01 0.000
Republicans.2016 -22.839 3.611700e+01 6.359700e+01 0.000
print(c("########## CLUSTER 6 ##########"))
[1] "########## CLUSTER 6 ##########"
round(fm.hcpc$desc.var$quanti[[6]][ ,c(1,2,3,6)],3)
v.test Mean in category Overall mean p.value
Spring.Tavg 31.187 6.079800e+02 5.392730e+02 0.000
temp 30.844 1.629200e+01 1.262300e+01 0.000
Annual.Tavg 30.844 6.132490e+02 5.472120e+02 0.000
Spring.Tmax 30.754 7.291810e+02 6.591830e+02 0.000
Spring.Tmin 30.623 4.861890e+02 4.187250e+02 0.000
Autumn.Tavg 30.566 6.253350e+02 5.615880e+02 0.000
Autumn.Tmax 30.484 7.445730e+02 6.799320e+02 0.000
Annual.Tmax 30.349 7.286310e+02 6.619730e+02 0.000
Annual.Tmin 30.218 4.980930e+02 4.322210e+02 0.000
Summer.Tavg 29.731 7.853870e+02 7.396110e+02 0.000
Winter.Tmax 29.443 5.430840e+02 4.490200e+02 0.000
Autumn.Tmin 29.302 5.063620e+02 4.431530e+02 0.000
Winter.Tavg 29.142 4.315530e+02 3.451100e+02 0.000
Summer.Tmin 28.886 6.757990e+02 6.224800e+02 0.000
Winter.Tmin 27.934 3.193820e+02 2.404230e+02 0.000
Summer.Tmax 27.160 8.946330e+02 8.571330e+02 0.000
Republicans.2008 24.986 6.645000e+01 5.678500e+01 0.000
Republicans.2012 23.444 6.932400e+01 5.964100e+01 0.000
Poor.physical.health.days 22.617 4.499000e+00 3.807000e+00 0.000
Republicans.2016 22.604 7.345500e+01 6.359700e+01 0.000
Uninsured 22.228 2.130000e-01 1.790000e-01 0.000
precip 21.797 1.193568e+03 9.825080e+02 0.000
Less.Than.High.School.Diploma 21.296 2.127000e+01 1.691100e+01 0.000
Diabetes 20.943 1.200000e-01 1.070000e-01 0.000
Spring.Prcp 20.710 1.222366e+03 1.020002e+03 0.000
Adult.smoking 20.396 2.440000e-01 2.110000e-01 0.000
Teen.births 20.311 5.534300e+01 4.408800e+01 0.000
Autumn.Prcp 20.205 1.123863e+03 9.303860e+02 0.000
Poor.mental.health.days 20.100 4.073000e+00 3.536000e+00 0.000
Winter.Prcp 19.847 1.087941e+03 8.071530e+02 0.000
Construction.extraction.maintenance.and.repair.occupations 17.267 1.314200e+01 1.151900e+01 0.000
FruitNutCV 17.046 3.431100e+01 2.496300e+01 0.000
Injury.deaths 15.526 8.559700e+01 7.575400e+01 0.000
FamilyRatio 13.944 6.980000e-01 6.780000e-01 0.000
Summer.Prcp 13.931 1.261076e+03 1.108740e+03 0.000
Adults.65.and.Older.Living.in.Poverty 12.189 1.338200e+01 1.152600e+01 0.000
Child.Poverty.living.in.families.below.the.poverty.line 11.933 2.432800e+01 2.114500e+01 0.000
Low.birthweight 11.644 9.000000e-02 8.300000e-02 0.000
Poverty.Rate.below.federal.poverty.threshold 11.212 1.747300e+01 1.547800e+01 0.000
Adult.obesity 11.050 3.190000e-01 3.060000e-01 0.000
Children.Under.6.Living.in.Poverty 10.824 2.846900e+01 2.487600e+01 0.000
Gini.Coefficient 8.871 4.410000e-01 4.320000e-01 0.000
Production.transportation.and.material.moving.occupations 8.792 1.772200e+01 1.625200e+01 0.000
lon 7.945 -8.915400e+01 -9.176500e+01 0.000
CFS 5.789 0.000000e+00 0.000000e+00 0.000
Children.in.single.parent.households 5.445 3.310000e-01 3.160000e-01 0.000
Female25to44 4.606 2.340000e-01 2.310000e-01 0.000
Male25to44 4.542 2.470000e-01 2.430000e-01 0.000
Unemployment 3.885 8.000000e-02 7.700000e-02 0.000
AveHousehouldSize 3.017 2.497000e+00 2.479000e+00 0.003
Violent.crime 2.992 2.684520e+02 2.514570e+02 0.003
AnimalCV 2.813 1.311500e+01 1.256700e+01 0.005
FemaleOver59 2.573 2.430000e-01 2.390000e-01 0.010
Black 2.349 9.777000e+00 8.830000e+00 0.019
Sales.and.office.occupations 2.127 2.304900e+01 2.284300e+01 0.033
RepChange12 -2.050 6.596000e+00 7.502000e+00 0.040
HIV.prevalence.rate -2.112 1.372190e+02 1.491930e+02 0.035
RepChange08 -2.114 1.032600e+01 1.158200e+01 0.035
VeggieSales -2.526 2.051556e+06 6.327674e+06 0.012
FruitNutSales -2.889 1.263334e+06 9.235039e+06 0.004
CountyHU -3.179 1.182256e+05 6.297451e+05 0.001
Service.occupations -3.433 1.711400e+01 1.744900e+01 0.001
AnimalSales -3.638 4.953442e+07 6.295049e+07 0.000
Female20to24 -3.652 5.500000e-02 5.700000e-02 0.000
VeggieCV -3.670 1.686200e+01 1.838500e+01 0.000
MarriedHouseholdRatio -3.709 7.550000e-01 7.630000e-01 0.000
Male20to24 -3.722 6.000000e-02 6.300000e-02 0.000
Female55to59 -3.926 6.900000e-02 7.000000e-02 0.000
Male44to54 -3.927 1.480000e-01 1.500000e-01 0.000
DemChange08 -4.031 -2.795000e+01 -2.615100e+01 0.000
Farming.fishing.and.forestry.occupations -4.120 1.812000e+00 2.110000e+00 0.000
MAR -4.322 7.040000e-01 7.350000e-01 0.000
HusbandWifeFamilyRatio -4.921 2.720000e-01 2.790000e-01 0.000
CountyArea -5.093 7.964887e+09 1.200905e+10 0.000
Female44to54 -5.407 1.470000e-01 1.490000e-01 0.000
Total.Population -5.740 4.770734e+04 9.775404e+04 0.000
Preschool.Enrollment.Ratio.enrolled.ages.3.and.4 -5.775 4.066000e+01 4.298400e+01 0.000
RenterOccupied -6.222 2.630000e-01 2.770000e-01 0.000
CA -6.279 -1.630000e-01 1.000000e-02 0.000
Votes -6.478 2.125039e+04 4.174833e+04 0.000
Precincts -6.567 2.322400e+01 5.489400e+01 0.000
AgLandCV -6.656 2.123400e+01 2.381600e+01 0.000
Median.Earnings.2010 -7.201 2.442473e+04 2.543765e+04 0.000
Sire.Homogeneity -7.687 6.800000e-01 7.190000e-01 0.000
Asian -7.703 5.860000e-01 1.071000e+00 0.000
CropSales -8.496 2.336845e+07 6.220057e+07 0.000
Male55to59 -8.657 6.700000e-02 7.000000e-02 0.000
School.Enrollment -8.944 7.368700e+01 7.498600e+01 0.000
Max.Alc -9.029 0.000000e+00 1.000000e-03 0.000
Mean.Alc -9.041 0.000000e+00 0.000000e+00 0.000
Freq -9.655 1.190000e+00 2.790000e+00 0.000
AgLandAcres -9.787 2.985184e+04 7.235322e+04 0.000
Mixedness -9.954 -2.450000e-01 -1.000000e-02 0.000
Graduate.Degree -11.976 5.157000e+00 6.445000e+00 0.000
ACFS -12.224 0.000000e+00 0.000000e+00 0.000
elevation -13.596 2.277840e+02 3.990520e+02 0.000
Management.professional.and.related.occupations -14.912 2.716000e+01 2.982700e+01 0.000
At.Least.Bachelors.s.Degree -15.428 1.527000e+01 1.899500e+01 0.000
Green.2016 -18.331 5.440000e-01 8.500000e-01 0.000
S -18.541 -8.300000e-01 -3.460000e-01 0.000
Democrats.2016 -19.196 2.347700e+01 3.169000e+01 0.000
At.Least.High.School.Diploma -20.913 7.862000e+01 8.300900e+01 0.000
Libertarians.2016 -21.186 2.266000e+00 3.163000e+00 0.000
Democrats.2012 -22.361 2.927300e+01 3.851200e+01 0.000
Democrats.2008 -24.069 3.224300e+01 4.156800e+01 0.000
lat -29.582 3.422000e+01 3.825200e+01 0.000
print(c("########## CLUSTER 7 ##########"))
[1] "########## CLUSTER 7 ##########"
round(fm.hcpc$desc.var$quanti[[7]][ ,c(1,2,3,6)],3)
v.test Mean in category Overall mean p.value
Black 40.152 4.241200e+01 8.830000e+00 0.000
Low.birthweight 32.580 1.220000e-01 8.300000e-02 0.000
Sexually.transmitted.infections 31.655 8.103660e+02 3.433310e+02 0.000
Children.in.single.parent.households 31.445 5.040000e-01 3.160000e-01 0.000
Mean.Alc 28.384 0.000000e+00 0.000000e+00 0.000
Child.Poverty.living.in.families.below.the.poverty.line 28.275 3.679900e+01 2.114500e+01 0.000
Poverty.Rate.below.federal.poverty.threshold 27.998 2.581700e+01 1.547800e+01 0.000
Adults.65.and.Older.Living.in.Poverty 25.633 1.962900e+01 1.152600e+01 0.000
Children.Under.6.Living.in.Poverty 25.586 4.250000e+01 2.487600e+01 0.000
Mixedness 25.233 1.226000e+00 -1.000000e-02 0.000
Diabetes 24.221 1.390000e-01 1.070000e-01 0.000
Adult.obesity 22.028 3.600000e-01 3.060000e-01 0.000
Less.Than.High.School.Diploma 22.009 2.626100e+01 1.691100e+01 0.000
Teen.births 21.879 6.924900e+01 4.408800e+01 0.000
ACFS 21.254 0.000000e+00 0.000000e+00 0.000
Democrats.2016 21.230 5.054100e+01 3.169000e+01 0.000
Max.Alc 20.057 1.000000e-03 1.000000e-03 0.000
HIV.prevalence.rate 19.650 3.804260e+02 1.491930e+02 0.000
Unemployment 19.641 1.080000e-01 7.700000e-02 0.000
DemChange08 19.616 -7.984000e+00 -2.615100e+01 0.000
Democrats.2012 19.450 5.519100e+01 3.851200e+01 0.000
Spring.Tavg 19.001 6.261530e+02 5.392730e+02 0.000
Spring.Tmin 18.839 5.048650e+02 4.187250e+02 0.000
Annual.Tavg 18.811 6.308000e+02 5.472120e+02 0.000
temp 18.811 1.726700e+01 1.262300e+01 0.000
Autumn.Tavg 18.617 6.421680e+02 5.615880e+02 0.000
Annual.Tmin 18.584 5.163000e+02 4.322210e+02 0.000
Spring.Tmax 18.477 7.464640e+02 6.591830e+02 0.000
Violent.crime 18.200 4.659950e+02 2.514570e+02 0.000
Autumn.Tmax 18.191 7.599860e+02 6.799320e+02 0.000
Winter.Tavg 18.187 4.570710e+02 3.451100e+02 0.000
Annual.Tmax 18.127 7.446010e+02 6.619730e+02 0.000
Autumn.Tmin 18.010 5.237820e+02 4.431530e+02 0.000
Winter.Tmax 17.955 5.680710e+02 4.490200e+02 0.000
Winter.Tmin 17.897 3.454130e+02 2.404230e+02 0.000
Gini.Coefficient 17.875 4.700000e-01 4.320000e-01 0.000
Summer.Tavg 16.976 7.938590e+02 7.396110e+02 0.000
Summer.Tmin 16.922 6.873070e+02 6.224800e+02 0.000
Democrats.2008 16.181 5.457800e+01 4.156800e+01 0.000
Winter.Prcp 15.673 1.267344e+03 8.071530e+02 0.000
MAR 15.270 9.560000e-01 7.350000e-01 0.000
Summer.Tmax 14.829 8.996250e+02 8.571330e+02 0.000
precip 14.356 1.271000e+03 9.825080e+02 0.000
DemChange12 13.342 -9.013000e+00 -1.946300e+01 0.000
Service.occupations 12.211 1.992200e+01 1.744900e+01 0.000
Spring.Prcp 11.048 1.244058e+03 1.020002e+03 0.000
Autumn.Prcp 10.752 1.144061e+03 9.303860e+02 0.000
Summer.Prcp 10.381 1.344324e+03 1.108740e+03 0.000
Uninsured 10.358 2.120000e-01 1.790000e-01 0.000
VeggieCV 9.614 2.666700e+01 1.838500e+01 0.000
Production.transportation.and.material.moving.occupations 8.995 1.937300e+01 1.625200e+01 0.000
RenterOccupied 8.359 3.140000e-01 2.770000e-01 0.000
Poor.physical.health.days 8.271 4.332000e+00 3.807000e+00 0.000
lon 7.413 -8.670900e+01 -9.176500e+01 0.000
AnimalCV 7.347 1.553800e+01 1.256700e+01 0.000
Preschool.Enrollment.Ratio.enrolled.ages.3.and.4 7.296 4.907600e+01 4.298400e+01 0.000
MaleUnder20 6.673 2.840000e-01 2.700000e-01 0.000
AveHousehouldSize 6.307 2.555000e+00 2.479000e+00 0.000
Poor.mental.health.days 6.058 3.872000e+00 3.536000e+00 0.000
Injury.deaths 5.956 8.359100e+01 7.575400e+01 0.000
FemaleUnder20 5.731 2.650000e-01 2.540000e-01 0.000
Male25to44 5.092 2.540000e-01 2.430000e-01 0.000
Adult.smoking 4.499 2.260000e-01 2.110000e-01 0.000
Male20to24 4.425 7.000000e-02 6.300000e-02 0.000
CropCV 3.869 1.922400e+01 1.774700e+01 0.000
Female20to24 3.825 6.300000e-02 5.700000e-02 0.000
TotalFemale 3.304 5.050000e-01 5.010000e-01 0.001
FruitNutCV 3.225 2.863400e+01 2.496300e+01 0.001
RepChange12 3.195 1.043400e+01 7.502000e+00 0.001
Farming.fishing.and.forestry.occupations 2.439 2.477000e+00 2.110000e+00 0.015
Female25to44 2.405 2.350000e-01 2.310000e-01 0.016
AgLandCV 2.246 2.562400e+01 2.381600e+01 0.025
AgLandAcres -1.988 5.443904e+04 7.235322e+04 0.047
CropSales -2.229 4.106047e+07 6.220057e+07 0.026
Female55to59 -2.449 6.900000e-02 7.000000e-02 0.014
Construction.extraction.maintenance.and.repair.occupations -2.554 1.102100e+01 1.151900e+01 0.011
Precincts -3.120 2.366900e+01 5.489400e+01 0.002
Total.Population -3.201 3.982949e+04 9.775404e+04 0.001
Sales.and.office.occupations -3.263 2.218900e+01 2.284300e+01 0.001
AnimalSales -3.317 3.755941e+07 6.295049e+07 0.001
CountyArea -3.572 6.123291e+09 1.200905e+10 0.000
AveFamilySize -3.918 2.887000e+00 2.923000e+00 0.000
Votes -4.071 1.501061e+04 4.174833e+04 0.000
Asian -4.387 4.980000e-01 1.071000e+00 0.000
FemaleOver59 -4.711 2.250000e-01 2.390000e-01 0.000
Male55to59 -5.091 6.700000e-02 7.000000e-02 0.000
School.Enrollment -5.156 7.343200e+01 7.498600e+01 0.000
Freq -5.516 8.930000e-01 2.790000e+00 0.000
Female44to54 -5.823 1.440000e-01 1.490000e-01 0.000
OtherRace -5.976 9.610000e-01 1.582000e+00 0.000
Male44to54 -6.206 1.440000e-01 1.500000e-01 0.000
Median.Age -8.293 3.756000e+01 3.990300e+01 0.000
MaleOver59 -8.678 1.770000e-01 2.040000e-01 0.000
Graduate.Degree -8.720 4.499000e+00 6.445000e+00 0.000
Green.2016 -10.192 4.970000e-01 8.500000e-01 0.000
elevation -11.287 1.039550e+02 3.990520e+02 0.000
At.Least.Bachelors.s.Degree -11.522 1.322100e+01 1.899500e+01 0.000
Median.Earnings.2010 -12.275 2.185399e+04 2.543765e+04 0.000
Management.professional.and.related.occupations -12.957 2.501700e+01 2.982700e+01 0.000
Republicans.2008 -15.287 4.451300e+01 5.678500e+01 0.000
Republicans.2016 -17.769 4.751400e+01 6.359700e+01 0.000
lat -17.972 3.316800e+01 3.825200e+01 0.000
Republicans.2012 -18.352 4.390800e+01 5.964100e+01 0.000
Sire.Homogeneity -19.627 5.100000e-01 7.190000e-01 0.000
HusbandWifeFamilyRatio -20.457 2.190000e-01 2.790000e-01 0.000
At.Least.High.School.Diploma -21.282 7.373900e+01 8.300900e+01 0.000
Libertarians.2016 -21.856 1.241000e+00 3.163000e+00 0.000
CFS -22.359 0.000000e+00 0.000000e+00 0.000
CA -27.782 -1.579000e+00 1.000000e-02 0.000
White -29.278 4.621200e+01 7.903500e+01 0.000
White..Asian -30.492 4.671000e+01 8.010600e+01 0.000
S -31.795 -2.068000e+00 -3.460000e-01 0.000
MarriedHouseholdRatio -33.784 6.220000e-01 7.630000e-01 0.000
CLUSTER 1: Older male population, higher elevation, farming, conservative, large area, small houshold size, married (retired??)
CLUSTER 2: educated, married, more Trump over Obama, females 44-59, insured, healthy
CLUSTER 3: high income, educated, liberal, married, high ave. household size, employed, healthy
CLUSTER 4: hot, poor health, construction and transportation jobs, single parents, low income, republican
CLUSTER 5: Hispanic, young, farming counties, lower education levels,
CLUSTER 6: young, renters, educated, democrats, multi-cultural,
CLUSTER 7: democrats, poor health, low education, renters, low family size
The number of farmers markets per county is positively correlated with clusters 3 and 5, and negatively correlated with clusters 1, 2, 4, 5, and 7.
Assign cluster and PCA coordinates to final data set.
All_Final_Data_Cleaned2 = cbind(All_Final_Data_Cleaned[ ,1:2], fm.hcpc$data.clust, fm.pca$ind$coord) %>% mutate_if(is.factor, as.character)
write.csv(All_Final_Data_Cleaned2, "Cleaned Data/All_Final_Data_Cleaned2.csv")
Oh lets map the clusters.
Let’s add the geography to create a shape file.
G2 = left_join(G1, All_Final_Data_Cleaned2, by = "RowName")
tm_view()
$tm_layout
$tm_layout$style
[1] NA
attr(,"class")
[1] "tm"
tm_shape(G2) + tm_fill("clust", palette = "Set1") +
tm_layout(legend.position = c(.87,.25),
legend.text.size = .5,
title.position = c("right","bottom"),
title = "Cluster",
title.fontface = "bold" ,
title.size = 1)
All_Final_Data_Cleaned2 %>% ggplot(aes(x = clust, y = Freq, color = clust)) + geom_point(aes(color = clust)) +
There were 28 warnings (use warnings() to see them)
scale_fill_brewer(palette="Set1") + labs(title="Farmers Markets by Cluster", face = "bold",x="Cluster", y = "Length") + theme_minimal()
In this section, we will first explore whether certain variables, whose domains have been deemed signiificant in the past, have a significant effect on the frequency of farmers markets per county, while controling for certain variables.
Since the response variable is a count data, we will explore Poisson, quasipoisson, and negative binomials. Zero inflated models will be investigated but the algorithms tend to fail to converge as the models became slightly more complex. Each variable or combination of variables will be tested in each of the aforementioned models, and the model with the lowest AIC will be selected. AIC is preferred for this study since it is penalized by the addition of more parameters. The selected model will be tested using a goodness of fit statistic.
A data frame of each model’s predicted value, as well the chi-squared contribution, will be created to ultimately assess which counties deviate most from what is expected under the assumptions of te given model.
We will use a log transformation on the popultion variable.
m1.pois = glm(Freq~ log(Total.Population), data = All_Final_Data_Cleaned2, family = "poisson")
m1.nb = glm.nb(Freq~ log(Total.Population), data = All_Final_Data_Cleaned2)
m1.pois0 = zeroinfl(m1.pois, dist = "poisson")
m1.nb0 = zeroinfl(m1.pois, dist = "negbin")
L1 = list(m1.pois, m1.nb, m1.pois0, m1.nb0)
print(rbind(c("Poisson", "NegBin", "ZeroInflPoisson", "ZeroInflNegBin"),AIC = round(sapply( L1, AIC),2)))
[,1] [,2] [,3] [,4]
"Poisson" "NegBin" "ZeroInflPoisson" "ZeroInflNegBin"
AIC "11273.54" "10464.35" "11276.6" "10468.35"
The negative binimial model is the preferred model.
summary(m1.nb)
Call:
glm.nb(formula = Freq ~ log(Total.Population), data = All_Final_Data_Cleaned2,
init.theta = 4.563667704, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7401 -0.9777 -0.2650 0.4209 4.1685
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.77504 0.12238 -55.36 <2e-16 ***
log(Total.Population) 0.70134 0.01085 64.66 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(4.5637) family taken to be 1)
Null deviance: 9110.4 on 3113 degrees of freedom
Residual deviance: 3077.0 on 3112 degrees of freedom
AIC: 10464
Number of Fisher Scoring iterations: 1
Theta: 4.564
Std. Err.: 0.324
2 x log-likelihood: -10458.354
plot(m1.nb)
m1 = m1.nb
Studies suggest that females 44-59 represent the largest demographic of shoppers at farmers markets, citing many are purchasing fresh produce o cook for their family. Thus we lookat average family size, controlling for population.
m2.pois = glm(Freq~ log(Total.Population) + AveFamilySize, data = All_Final_Data_Cleaned2, family = "poisson")
m2.nb = glm.nb(Freq~ log(Total.Population) + AveFamilySize, data = All_Final_Data_Cleaned2)
m2.pois0 = zeroinfl(m2.pois, dist = "poisson")
m2.nb0 = zeroinfl(m2.pois, dist = "negbin")
L2 = list(m2.pois, m2.nb, m2.pois0, m2.nb0)
rbind(c("Poisson", "NegBin", "ZeroInflPoisson", "ZeroInflNegBin"),AIC = round(sapply( L2, AIC),2))
[,1] [,2] [,3] [,4]
"Poisson" "NegBin" "ZeroInflPoisson" "ZeroInflNegBin"
AIC "10988.61" "10299.54" "10990" "10302.16"
Again, the negative binimial model is the preferred model.
summary(m2.nb)
Call:
glm.nb(formula = Freq ~ log(Total.Population) + AveFamilySize,
data = All_Final_Data_Cleaned2, init.theta = 5.25030653,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8040 -0.9503 -0.2584 0.4492 4.0972
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.25374 0.29960 -10.86 <2e-16 ***
log(Total.Population) 0.75422 0.01137 66.32 <2e-16 ***
AveFamilySize -1.39837 0.10969 -12.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(5.2503) family taken to be 1)
Null deviance: 9587.5 on 3113 degrees of freedom
Residual deviance: 3021.8 on 3111 degrees of freedom
AIC: 10300
Number of Fisher Scoring iterations: 1
Theta: 5.250
Std. Err.: 0.394
2 x log-likelihood: -10291.539
plot(m2.nb)
m2 = m2.nb
Now lets see how the coordinates of the PCA pan out. We’ll use the first 18 dimensions from the PCA, enough to account for 80% of variation between the dimensions. We will start by using all dimensions to select a distribution. Then we will eliminate non-significant dimensions one at a time, removing the dimension with the highest p-value.
m3.pois = glm(Freq~ ., data = All_Final_Data_Cleaned2 %>% dplyr::select(Freq, Dim.1:Dim.18), family = "poisson")
m3.nb = glm.nb(Freq~ ., data = All_Final_Data_Cleaned2 %>% dplyr::select(Freq, Dim.1:Dim.18))
alternation limit reached
m3.pois0 = zeroinfl(m3.pois, dist = "poisson")
glm.fit: fitted probabilities numerically 0 or 1 occurred
m3.nb0 = zeroinfl(m3.pois, dist = "negbin")
glm.fit: fitted probabilities numerically 0 or 1 occurred
L3 = list(m3.pois, m3.nb, m3.pois0, m3.nb0)
rbind(c("Poisson", "NegBin", "ZeroInflPoisson", "ZeroInflNegBin"),AIC = round(sapply( L3, AIC),2))
[,1] [,2] [,3] [,4]
"Poisson" "NegBin" "ZeroInflPoisson" "ZeroInflNegBin"
AIC "11425.9" "10402.19" "11244.8" "10341.26"
Again, the negative binimial model is the preferred model.
summary(m3.nb0)
Call:
zeroinfl(formula = m3.pois, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-2.1401 -0.6723 -0.1954 0.4338 19.8183
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.511824 0.020870 24.524 < 2e-16 ***
Dim.1 -0.032506 0.003980 -8.167 3.17e-16 ***
Dim.2 0.186849 0.004112 45.438 < 2e-16 ***
Dim.3 -0.006527 0.006179 -1.056 0.290860
Dim.4 0.138626 0.006366 21.774 < 2e-16 ***
Dim.5 0.013837 0.007325 1.889 0.058881 .
Dim.6 -0.162546 0.008097 -20.074 < 2e-16 ***
Dim.7 0.032420 0.007392 4.386 1.16e-05 ***
Dim.8 0.029352 0.009593 3.060 0.002216 **
Dim.9 -0.034286 0.009765 -3.511 0.000446 ***
Dim.10 0.023687 0.011341 2.089 0.036737 *
Dim.11 0.076589 0.013406 5.713 1.11e-08 ***
Dim.12 0.010834 0.013557 0.799 0.424211
Dim.13 -0.030414 0.014151 -2.149 0.031613 *
Dim.14 0.019759 0.013461 1.468 0.142160
Dim.15 0.039606 0.016165 2.450 0.014279 *
Dim.16 0.031493 0.015454 2.038 0.041570 *
Dim.17 -0.021219 0.015328 -1.384 0.166242
Dim.18 -0.074528 0.014847 -5.020 5.18e-07 ***
Log(theta) 1.554344 0.068730 22.615 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.6404 1.6937 -5.692 1.26e-08 ***
Dim.1 0.2463 0.1176 2.094 0.036233 *
Dim.2 -2.1906 0.4515 -4.852 1.22e-06 ***
Dim.3 0.1929 0.1410 1.368 0.171403
Dim.4 -0.6238 0.1569 -3.977 6.98e-05 ***
Dim.5 -0.6823 0.2397 -2.847 0.004419 **
Dim.6 3.3647 0.6462 5.207 1.92e-07 ***
Dim.7 -2.4141 0.5909 -4.086 4.39e-05 ***
Dim.8 -4.5314 0.9892 -4.581 4.62e-06 ***
Dim.9 -1.9116 0.4908 -3.895 9.84e-05 ***
Dim.10 2.0023 0.5575 3.592 0.000328 ***
Dim.11 -0.4645 0.1846 -2.516 0.011853 *
Dim.12 0.3165 0.2231 1.419 0.155980
Dim.13 0.5409 0.2115 2.557 0.010549 *
Dim.14 -0.1144 0.1773 -0.645 0.518881
Dim.15 0.9210 0.3191 2.886 0.003898 **
Dim.16 0.3528 0.2947 1.197 0.231312
Dim.17 0.6595 0.3428 1.924 0.054370 .
Dim.18 -0.6653 0.3234 -2.057 0.039680 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 4.732
Number of iterations in BFGS optimization: 79
Log-likelihood: -5132 on 39 Df
plot(m3.nb0)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' is a list, but does not have components 'x' and 'y'
We can use backwards elimination to reduce dimesnsion.
m4.nb0 = be.zeroinfl(m3.nb0, All_Final_Data_Cleaned2, dist = "negbin")
glm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m4.nb0)
Call:
zeroinfl(formula = eval(parse(text = out)), data = data, dist = dist)
Pearson residuals:
Min 1Q Median 3Q Max
-2.1336 -0.6998 -0.1868 0.4376 41.7427
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.492689 0.020633 23.879 < 2e-16 ***
Dim.1 -0.038999 0.003597 -10.843 < 2e-16 ***
Dim.2 0.187890 0.003989 47.101 < 2e-16 ***
Dim.4 0.140961 0.006277 22.456 < 2e-16 ***
Dim.5 0.019696 0.007008 2.811 0.004945 **
Dim.6 -0.168274 0.008009 -21.010 < 2e-16 ***
Dim.7 0.025218 0.007330 3.440 0.000581 ***
Dim.8 0.028972 0.009563 3.029 0.002450 **
Dim.9 -0.027213 0.009564 -2.845 0.004435 **
Dim.10 0.022532 0.011012 2.046 0.040748 *
Dim.11 0.083064 0.013081 6.350 2.15e-10 ***
Dim.13 -0.044514 0.013383 -3.326 0.000880 ***
Dim.15 0.055444 0.015776 3.514 0.000441 ***
Dim.16 0.049225 0.014690 3.351 0.000806 ***
Dim.18 -0.075191 0.014575 -5.159 2.48e-07 ***
Log(theta) 1.546323 0.070232 22.017 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.9194 1.7067 -4.640 3.48e-06 ***
Dim.2 -1.1119 0.2571 -4.325 1.53e-05 ***
Dim.4 -0.3994 0.1185 -3.372 0.000747 ***
Dim.6 1.8369 0.3608 5.091 3.56e-07 ***
Dim.7 -1.6013 0.3634 -4.406 1.05e-05 ***
Dim.8 -2.4572 0.5199 -4.727 2.28e-06 ***
Dim.9 -0.8262 0.3069 -2.692 0.007098 **
Dim.10 0.6139 0.2319 2.647 0.008117 **
Dim.11 -0.9988 0.2710 -3.686 0.000228 ***
Dim.15 1.1735 0.3681 3.188 0.001433 **
Dim.16 1.0770 0.3667 2.937 0.003316 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 4.6942
Number of iterations in BFGS optimization: 55
Log-likelihood: -5151 on 27 Df
AIC(m4.nb0)
[1] 10356.31
m4 = m4.nb0
The PCA analysis from before helps shed light on which variables are independent. Furthermore, the previous research into customer profiles helps shed light on which variables give us reasonable information. With this in mind, I set out to find some combination of variables that include measures of county population, political preference, family relations, age distribution, income, agricultural data, health, home ownership demographics, and eduacation. Each variable or variables were selected to reduce AIC in the model while remaining as a significant predictor. Extreme caution should be used in interpreting the results.
We can see the if the correlation between the predictors.
M = All_Final_Data_Cleaned2 %>% dplyr::select(Freq, Total.Population,Green.2016 , Median.Earnings.2010 , FruitNutCV ,Median.Age , RenterOccupied , Poor.physical.health.days ,Annual.Tavg , precip , Graduate.Degree)
corrplot(cor(M))
Although Graduate degree appears to be colinear with other factors, I havve chosen to let it remain in the model since it does tend to lower the AIC in the models I have experimented with and PCA has shown it to be a strong representation in the lower dimensons of the PCA.
m5.pois = glm(Freq~ log(Total.Population) + Green.2016 + Median.Earnings.2010 + FruitNutCV + Median.Age + RenterOccupied + Poor.physical.health.days + Annual.Tavg + precip + Graduate.Degree , data = All_Final_Data_Cleaned2, family = "poisson")
m5.nb = glm.nb(Freq ~ log(Total.Population) + Green.2016 + Median.Earnings.2010 + FruitNutCV + Median.Age + RenterOccupied + Poor.physical.health.days + Annual.Tavg + precip + Graduate.Degree , data = All_Final_Data_Reduced )
L5 = list(m5.pois, m5.nb)
rbind(c("Poisson", "NegBin"),AIC = round(sapply( L5, AIC),2))
[,1] [,2]
"Poisson" "NegBin"
AIC "9600.41" "9447.58"
summary(m5.nb)
Call:
glm.nb(formula = Freq ~ log(Total.Population) + Green.2016 +
Median.Earnings.2010 + FruitNutCV + Median.Age + RenterOccupied +
Poor.physical.health.days + Annual.Tavg + precip + Graduate.Degree,
data = All_Final_Data_Reduced, init.theta = 16.04581749,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7886 -0.9084 -0.1959 0.4780 4.5338
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.762e+00 2.381e-01 -28.398 < 2e-16 ***
log(Total.Population) 7.003e-01 1.284e-02 54.540 < 2e-16 ***
Green.2016 2.233e-01 2.156e-02 10.361 < 2e-16 ***
Median.Earnings.2010 -1.331e-05 3.114e-06 -4.273 1.93e-05 ***
FruitNutCV -5.302e-03 9.122e-04 -5.812 6.17e-09 ***
Median.Age 4.379e-02 3.663e-03 11.955 < 2e-16 ***
RenterOccupied 1.141e+00 2.045e-01 5.580 2.41e-08 ***
Poor.physical.health.days -9.581e-02 1.828e-02 -5.241 1.59e-07 ***
Annual.Tavg -3.326e-03 2.231e-04 -14.908 < 2e-16 ***
precip 1.568e-04 4.533e-05 3.458 0.000544 ***
Graduate.Degree 2.290e-02 3.957e-03 5.786 7.19e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(16.0458) family taken to be 1)
Null deviance: 12964.3 on 3113 degrees of freedom
Residual deviance: 2757.6 on 3103 degrees of freedom
AIC: 9447.6
Number of Fisher Scoring iterations: 1
Theta: 16.05
Std. Err.: 2.03
2 x log-likelihood: -9423.584
plot(m5.nb)
m5 = m5.nb0
Goodness of Fit. Here we test H0: The observed farmers market frequencies follow the given negative binomial model vs. H1: The observed farmers market frequencies do not follow the given negative binomial model using a chi-squared goodness of fit with N - p = 3114 - 11 = 3103 df.
qchisq(.95, 3103)
[1] 3233.706
The 5% critical value is 3233.706, which is greater than the observed deviance of 2757.6. So we fail to reject the null and can conclude that the given negative binomial model does fit the observed data.
Although it is possible to reduce AIC further, this model is can help us make some general interpretations about the frequency
Lastly we can use zero inflated model with backwards selection with the remaining variables to reduce AIC to build a model with the highest predictive power.
m5.nb0 = zeroinfl(m5.nb, data = All_Final_Data_Cleaned2, dist = "negbin")
m6 = be.zeroinfl(m5.nb0,data = All_Final_Data_Cleaned2, dist = "negbin")
NaNs producedNaNs producedNaNs produced
AIC(m5.nb)
[1] 9447.584
AIC(m6)
[1] 9407.916
summary(m6)
NaNs produced
Call:
zeroinfl(formula = eval(parse(text = out)), data = data, dist = dist)
Pearson residuals:
Min 1Q Median 3Q Max
-2.0917 -0.6741 -0.1406 0.4980 8.5273
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.744e+00 2.396e-01 -28.150 < 2e-16 ***
log(Total.Population) 6.883e-01 1.122e-02 61.329 < 2e-16 ***
Green.2016 2.242e-01 2.159e-02 10.384 < 2e-16 ***
Median.Earnings.2010 -1.240e-05 3.546e-08 -349.630 < 2e-16 ***
FruitNutCV -4.187e-03 9.478e-04 -4.417 1.00e-05 ***
Median.Age 4.378e-02 3.666e-03 11.944 < 2e-16 ***
RenterOccupied 1.178e+00 1.954e-01 6.028 1.66e-09 ***
Poor.physical.health.days -1.043e-01 1.825e-02 -5.714 1.11e-08 ***
Annual.Tavg -3.088e-03 2.278e-04 -13.558 < 2e-16 ***
precip 1.407e-04 4.673e-05 3.011 0.0026 **
Graduate.Degree 2.173e-02 3.640e-03 5.970 2.38e-09 ***
Log(theta) 2.786e+00 4.545e-03 612.997 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.300e+00 5.983e+00 -0.384 0.700698
log(Total.Population) -2.403e+00 4.267e-01 -5.633 1.78e-08 ***
Median.Earnings.2010 7.312e-05 NA NA NA
FruitNutCV 6.164e-02 1.616e-02 3.813 0.000137 ***
Poor.physical.health.days -1.570e+00 4.193e-01 -3.745 0.000180 ***
Annual.Tavg 4.691e-02 8.062e-03 5.819 5.93e-09 ***
Graduate.Degree -1.114e+00 2.668e-01 -4.177 2.96e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 16.2177
Number of iterations in BFGS optimization: 69
Log-likelihood: -4685 on 19 Df
We’ll make a data set containing the observed frequncy and the expected frequency under the last four models. Models 1 and 2 are too simple and their AIC is too high for final consideration. We construct the chi-squared component of each observation and model, and record whether the residual is possitive or negative.
Expected = data.frame(cbind(All_Final_Data_Reduced$Freq, m3$fitted.values, m4$fitted.values,
m5$fitted.values, m6$fitted.values, m3$residuals, m4$residuals, m5$residuals, m6$residuals )) %>%
rename(Freq = X1, ExpectedMod3 = X2, ExpectedMod4 = X3, ExpectedMod5 = X4, ExpectedMod6 = X5,
Resid3 = X6, Resid4 = X7, Resid5 = X8, Resid6 = X9) %>% mutate(Component3 = (Resid3^2)/ExpectedMod3,
Component4 = (Resid4^2)/ExpectedMod4,
Component5 = (Resid5^2)/ExpectedMod5,
Component6 = (Resid6^2)/ExpectedMod6) %>%
mutate(Component3 = ifelse(Resid3 < 0, -Component3 , Component3),
Component4 = ifelse(Resid4 < 0, -Component4 , Component4),
Component5 = ifelse(Resid5 < 0, -Component5 , Component5),
Component6 = ifelse(Resid6 < 0, -Component6 , Component6))
rownames(Expected) = rownames(All_Final_Data_Cleaned2)
Expected = round(Expected, 4) %>% arrange(Component6)
head(Expected)
The data has been sorted in descending terms of the “negative” chi squared contributions for model 6, which has the lowest AIC. Models 3 and 4 use the PCA components and the model has strange behavior when handling outliers. Three major outliers were identified, Los Angeles County, CA and Cook County, IL have very large populations, and Shannon County, SD has a very high Native American population. Thus the PCA models should be interpreted with caution, especially when dealing with counties with exetreme observations.
Queens.County.NY, Richmond.County.NY, and Arapahoe.County.CO are the three leading candidates that show promise for holding more farmers markets. Queens has only 19 farmers markets when the models 5 and 6 predict anywhere from about 46 farmers markets. Arapahoe County, CO, containing portions of suburban Denver, has only 4 markets where the models predict somewhere between 9 and 14.